Page 2 of 3

Re: Flapping issue...

Posted: Tue May 05, 2015 8:54 am
by PhilG
jdalrymple wrote:Based upon what you've said the only suggestions I would have are to either revert to the older nsclient++ version (or maybe upgrade to a newer one), or get over to http://forums.nsclient.org and see if there are any suggestions by the developer.

The fact that there are no host alerts almost entirely narrows the problem down to the scope of the nsclient service, unless of course you've fiddled with your host command. Since you weren't aware that it was coming down from a template I'd guess that's almost gotta be a "no".

I'll uninstall the NSClient++ and install the older client this time.

It IS possible that since this server hosts a website that was written by a developer that the website has some rogue processes/threads that might be causing performance issues, which may affect the NSClient++.

We've had a consultant come in and run tests a couple of years ago on a different environment this developer has, and the developer claimed that it was NOT their code causing performance issues, and the consultant provided a report that it was not the network or the server setup that was the issue.

I'll reply back when I have a chance to do that.

Re: Flapping issue...

Posted: Tue May 05, 2015 9:41 am
by lmiltchev
I'll reply back when I have a chance to do that.
Let us know how things are working with the older version of NSClient++ whenever you have a chance.

Re: Flapping issue...

Posted: Tue May 05, 2015 12:33 pm
by PhilG
lmiltchev wrote:
I'll reply back when I have a chance to do that.
Let us know how things are working with the older version of NSClient++ whenever you have a chance.
Okay, I'll provide my steps and the results, so stay with me on this.

1). We are running Nagios XI 2014R2.6 on the Nagios XI monitoring server (I know that 2014R2.7 has recently been released).

2). The server that was reporting an issue is a physical server running Windows 2008 R2 64 bit.

3). I have uninstalled the NSClient++ 0.4.1.105-x64 (I have this client installed on a few other Windows servers and they have no issue) on the "problem" server, and then installed NSClient++ 0.3.9-x64 (the client that is linked within Nagios XI and the Windows Server and is installed on the majority of our Windows 2003 and 2008 servers).

4). The important NSClient++ "NSC.INI" file entries are:
[modules]
NSClientListener.dll
CheckWMI.dll
FileLogger.dll
CheckSystem.dll
CheckDisk.dll
CheckEventLog.dll
CheckHelpers.dll

use_file=1
allowed_hosts=<IP Address of Nagios XI server>
password=<correct password>

*POSSIBLE ISSUE HERE* This is the default and is configured on all the other Windows servers, in the same VLAN, which are not experiencing issues
;# NSCLIENT PORT NUMBER
; This is the port the NSClientListener.dll will listen to.
;port=12489

[External Alias]
alias_cpu=checkCPU warn=80 crit=90 time=5m time=1m time=30s <---- This concerns me since I configure the Nagios XI Windows Server Wizard with warn at 90 and crit at 95. This is on all
the other Windows servers, too.
alias_cpu_ex=checkCPU warn=$ARG1$ crit=$ARG2$ time=5m time=1m time=30s

alias_mem=checkMem MaxWarn=80% MaxCrit=90% ShowAll=long type=physical type=virtual type=paged type=page <---- This concerns me since I configure the Nagios XI Windows Server Wizard with
warn at 90 and crit at 95. This is on all the other Windows servers, too.

alias_up=checkUpTime MinWarn=1d MinWarn=1h


[NRPE Client Handlers]
check_other=-H <some default IP address> -p 5666 -c remote_command -a arguments


5). The following are the Nagios XI Windows Server Wizard checks for the "problem server" (and are manually setup the same way with all the other Windows servers):
check_xi_service_nsclient!nagi0sadm1n!UPTIME
check_xi_service_nsclient!nagi0sadm1n!CPULOAD!-l 5,90,95
check_xi_service_nsclient!nagi0sadm1n!USEDDISKSPACE!-l C -w 90 -c 95
***The next check identifies a difference between the default client check and the Wizard check that I configure***
check_xi_service_nsclient!nagi0sadm1n!MEMUSE!-w 90 -c 95


6). I did a little research and came across the following:
a). "http://geekpeek.net/socket-timeout-afte ... ds-nagios/"
b). "http://support.nagios.com/forum/viewtop ... =7&t=24924"

7). From the Nagios XI server console, I ran NMAP against the "problem" server:
a).
Starting Nmap 5.51 ( http://nmap.org ) at 2015-05-05 11:03 CDT
Nmap scan report for <FQDN of server> (<IP Address of server>)
Host is up (0.00068s latency).
Not shown: 986 closed ports
PORT STATE SERVICE
80/tcp open http
135/tcp open msrpc
139/tcp open netbios-ssn
445/tcp open microsoft-ds
1025/tcp open NFS-or-IIS
1026/tcp open LSA-or-nterm
1027/tcp open IIS
1028/tcp open unknown
2301/tcp open compaqdiag
2381/tcp open compaq-https
3389/tcp open ms-term-serv
8400/tcp open cvd
8402/tcp open abarsd
8600/tcp open asterix

b). Then I ran the modified version: nmap -p 5666,12489:
PORT STATE SERVICE
5666/tcp closed nrpe
12489/tcp open unknown

8). I modified the following in the NSC.INI on the "problem" server, then restarted the NSClient++ service:
a). Uncommented the port line:
;# NSCLIENT PORT NUMBER
; This is the port the NSClientListener.dll will listen to.
port=12489

b). Modified the warning and critical levels:
[External Alias]
alias_cpu=checkCPU warn=90 crit=95 time=5m time=1m time=30s
alias_mem=checkMem MaxWarn=90% MaxCrit=95% ShowAll=long type=physical type=virtual type=paged type=page

9). I have verified that the "problem" server's Windows Firewall was turned off, but I noted that the NSClient++ was not listed as an exception (like I noted in a couple of other Windows servers), just in case.

10). I have deleted the previous "problem" server's Services and Host, then applied the configuration, then added the "problem" server back in to Nagios XI using the Windows Server Wizard, and have been monitoring the "problem" server for a couple of hours.

11). The two Services that pose a CRITICAL - Socket timeout after 10 seconds are "Uptime", "IIS Web Server", "Memory Usage", and "Drive C: Disk Usage". The "Uptime" service has been extremely flaky/been flapping, but appears to clear up after a scheduled forced immediate check.

12). System Uptime reports 569 days!! How can I clear this - a server reboot??

Re: Flapping issue...

Posted: Tue May 05, 2015 2:01 pm
by jdalrymple
Anything useful in the Windows event log regarding the NSCP service? Is it per-chance crashing then respawning?

Re: Flapping issue...

Posted: Tue May 05, 2015 2:40 pm
by PhilG
jdalrymple wrote:Anything useful in the Windows event log regarding the NSCP service? Is it per-chance crashing then respawning?

I checked the "problem" server's Windows Event Viewer, but I am sorry to report that "NSCP" and "NSClient" were not found in any event log.

The Nagios XI monitoring server's "/usr/local/nagios/var/nagios.log" confirms the error (suspecting that this file correlates to the GUI console information).

The "problem" server's "NSClient.log" did report this back at 11:09:11 am and 11:09:21 am:
"error:modules\NSClientListener\NSClientListener.cpp:276: Read on socket failed: recv returned SOCKET_ERROR: 10054: An existing connection was forcibly closed by the remote host."

That's about it.

Re: Flapping issue...

Posted: Tue May 05, 2015 3:29 pm
by jdalrymple
I'm about out of ideas. Have you posted this on the nsclient forums? That may be useful...

If it were my machine the next troubleshooting step I'd take would be to remove all of the existing check_nt services and create one that just does "-v CLIENTVERSION" to see if that simple little check can stay stable:

Code: Select all

[jdalrymple@localhost libexec]$ ./check_nt -H windowsserver -s password -p 12489 -v CLIENTVERSION
NSClient++ 0.4.2.114 2015-01-08

Re: Flapping issue...

Posted: Tue May 05, 2015 3:51 pm
by PhilG
jdalrymple wrote:I'm about out of ideas. Have you posted this on the nsclient forums? That may be useful...

If it were my machine the next troubleshooting step I'd take would be to remove all of the existing check_nt services and create one that just does "-v CLIENTVERSION" to see if that simple little check can stay stable:

Code: Select all

[jdalrymple@localhost libexec]$ ./check_nt -H windowsserver -s password -p 12489 -v CLIENTVERSION
NSClient++ 0.4.2.114 2015-01-08
I wonder if this is the problem: This "problem" server was setup a long time ago by a colleague who retired last month. I used NSLOOKUP and identified that the server FQDN: <server name>.<subdomain name>.<domain name> is associated with THREE IP addresses in our DNS!! One IP is part of our DEV environment in one VLAN, one IP is associated with our TEST environment in a second VLAN, and the third IP is associated in our PROD environment (the one we need to monitor) in another VLAN.
DEV uses IP x.y.A.z
Test uses IP x.y.B.z
Prod uses IP x.y.C.z

The subdomains of "A", "B", "C" are configured within their respective VLANs.

Is Nagios XI trying to check the FQDN DNS name that was configured during the Windows Server Wizard checks but then gets confuzzled when the IP returns differently during the check, thus causing those timeout errors?

Re: Flapping issue...

Posted: Tue May 05, 2015 4:04 pm
by jdalrymple
Fix that for sure - I expected some "awkward" networking issue like that from the beginning...

That said Nagios' behavior is that it uses whatever is in the address field in the host definition. If you put a hostname in there it will resolve it as often as the system needs to. If you put an IP in there, DNS doesn't come into play. One sure fire way to find out is to go back into CCM and adjust the host to use an IP instead of a hostname. If you're a DNS-centric organization then this isn't ideal, but then again neither is having zombie DNS entries...

Re: Flapping issue...

Posted: Tue May 05, 2015 4:15 pm
by PhilG
jdalrymple wrote:Fix that for sure - I expected some "awkward" networking issue like that from the beginning...

That said Nagios' behavior is that it uses whatever is in the address field in the host definition. If you put a hostname in there it will resolve it as often as the system needs to. If you put an IP in there, DNS doesn't come into play. One sure fire way to find out is to go back into CCM and adjust the host to use an IP instead of a hostname. If you're a DNS-centric organization then this isn't ideal, but then again neither is having zombie DNS entries...
I'm sure there was a reason why they did this back a few years ago. I can contact the stakeholder and see if they are doing any Dev and/or Test work, and if they are not, can remove those DNS entries; ELSE, will have to rely on your suggestion.

Please do not freeze this post yet.

Thanks.

Re: Flapping issue...

Posted: Wed May 06, 2015 9:19 am
by tmcdonald
We'll keep this open until we hear back from you.