Page 1 of 3

Flapping issue...

Posted: Mon Apr 27, 2015 11:47 am
by PhilG
Hello:
We are using Nagios XI 2014R2.6, and a Host, which was starting to report Critical Status Information in the past couple of weeks, was running NSClient++ 0.3.9 64 bit. I installed the "newer" 0.4.1.105 64 bit version at this time to see if that would clear the issue.
Note: I confirmed with my colleagues, who don't work with the server, that they have not upgraded/made any changes to that Host/server. I have not done any work on that Host/server until now when I installed the newer NSClient++.

The Memory Usage, the CPU Usage, the Drive C: Disk Usage, and the IIS Web Server checks appear to flap and give the following Status Information message:
“CRITICAL - Socket timeout after 10 seconds”.
The Critical clears after a bit, with the following:
1). Memory Usage - “Memory usage: total:32592.11 MB - used: 2519.97 MB (8%) - free: 30072.13 MB (92%)” <-- which is within the threshold checks of warn on 90% and go critical on 95%
2). IIS Web Server - “W3SVC: Started” (the Windows server web service)
3). CPU Usage - “CPU Load 0% (5 min average)”. <-- I'm sure that this was a particular point in time reference when I looked.
4). Drive C: Disk Usage - C:\ - total: 68.33 Gb - used: 51.48 Gb (75%) - free 16.85 Gb (25%) <-- which is within the threshold checks of warn on 90% and go critical on 95%.

This doesn’t happen with the other Hosts and Services that are set up and reporting with Nagios XI (just over 300 Windows and Linux servers/Hosts) and over 1300 total Services being monitored).

Suggestions??

Thank you in advance.


UPDATE: These are the checks used:
Drive C: Disk Usage:
https://<my_web_site_URL>/nagiosxi/includes/components/xicore/status.php?show=servicedetail&host=<my_Server_FQDN>&service=Drive+C%3A+Disk+Usage&dest=auto
Status Information: "CRITICAL - Socket timeout after 10 seconds"
check_xi_service_nsclient!<edited>!USEDDISKSPACE!-l C -w 90 -c 95


IIS Web Server:
https://<my_web_site_URL>/nagiosxi/includes/components/xicore/status.php?show=servicedetail&host=<my_Server_FQDN>&service=IIS+Web+Server&dest=auto
Status Information: "CRITICAL - Socket timeout after 10 seconds"
check_xi_service_nsclient!<edited>!SERVICESTATE!-l W3SVC -d SHOWALL


Memory Usage:
https://<my_web_site_URL>/nagiosxi/includes/components/xicore/status.php?show=servicedetail&host=<my_Server_FQDN>&service=Memory+Usage&dest=auto
Status Information: "could not fetch information from server"
check_xi_service_nsclient!<edited>!MEMUSE!-w 90 -c 95

Re: Flapping issue...

Posted: Mon Apr 27, 2015 1:00 pm
by lmiltchev
Let's rule out firewall issues. Run the following command and show us the output:

Code: Select all

nmap <client ip> -p 12489
Also, show the output of:

Code: Select all

time /usr/local/nagios/libexec/check_nt -H <client ip> -s <password> -p 12489 -v USEDDISKSPACE -l C -w 90 -c 95

Re: Flapping issue...

Posted: Thu Apr 30, 2015 4:23 pm
by PhilG
Sorry for the delay.

Here are the results of your requests:
1). Response from "nmap <client ip> -p 12489":

Starting Nmap 5.51 ( http://nmap.org ) at 2015-04-30 16:19 CDT
Nmap scan report for <Server FQDN> (Server_IP)
Host is up (0.00067s latency).
PORT STATE SERVICE
12489/tcp open unknown

Nmap done: 1 IP address (1 host up) scanned in 0.09 seconds


2). Response from "time /usr/local/nagios/libexec/check_nt -H <client ip> -s <password> -p 12489 -v USEDDISKSPACE -l C -w 90 -c 95":

C:\ - total: 68.33 Gb - used: 51.49 Gb (75%) - free 16.84 Gb (25%) | 'C:\ Used Space'=51.49Gb;61.50;64.91;0.00;68.33

real 0m0.004s
user 0m0.002s
sys 0m0.001s

Re: Flapping issue...

Posted: Thu Apr 30, 2015 4:57 pm
by jdalrymple
It sounds like a funky network issue - like maybe some arp poisoning (multiple machines using 1 IP) or some such - that almost can't be the case though if you're not getting host alerts. What is your host check command and interval?

Re: Flapping issue...

Posted: Mon May 04, 2015 2:01 pm
by PhilG
jdalrymple wrote:It sounds like a funky network issue - like maybe some arp poisoning (multiple machines using 1 IP) or some such - that almost can't be the case though if you're not getting host alerts. What is your host check command and interval?
"Monitor the host with this command" - this is not configured/has a blank field. Same goes with many of my other Hosts, but those work with no issues.

Re: Flapping issue...

Posted: Mon May 04, 2015 2:29 pm
by jdalrymple
PhilG wrote:"Monitor the host with this command" - this is not configured/has a blank field.
Inherited from template no doubt. I suggest you read up on templates and inheritance if you care to understand how that works.
PhilG wrote:Same goes with many of my other Hosts, but those work with no issues.
Are you saying that you are getting host alerts for the hosts with problematic services?

Re: Flapping issue...

Posted: Mon May 04, 2015 3:12 pm
by PhilG
jdalrymple wrote:
PhilG wrote:"Monitor the host with this command" - this is not configured/has a blank field.
Are you saying that you are getting host alerts for the hosts with problematic services?
Well, it is a Microsoft Windows IIS server. Need I say anymore? ;)

Re: Flapping issue...

Posted: Mon May 04, 2015 3:54 pm
by jdalrymple
I'm sorry, I'm not quite following. Does that mean that you are indeed also getting host alerts in addition to your service alerts?

Re: Flapping issue...

Posted: Mon May 04, 2015 4:15 pm
by PhilG
jdalrymple wrote:I'm sorry, I'm not quite following. Does that mean that you are indeed also getting host alerts in addition to your service alerts?

Host is fine. I find it odd that this is the only Host that is getting
" CRITICAL - Socket timeout after 10 seconds"
on CPU Usage, C: disk partition, IIS Web server, and whatever else is monitors. Sometimes it's all services, sometimes its not.

Re: Flapping issue...

Posted: Mon May 04, 2015 4:25 pm
by jdalrymple
Based upon what you've said the only suggestions I would have are to either revert to the older nsclient++ version (or maybe upgrade to a newer one), or get over to http://forums.nsclient.org and see if there are any suggestions by the developer.

The fact that there are no host alerts almost entirely narrows the problem down to the scope of the nsclient service, unless of course you've fiddled with your host command. Since you weren't aware that it was coming down from a template I'd guess that's almost gotta be a "no".