Flapping issue...
Flapping issue...
Hello:
We are using Nagios XI 2014R2.6, and a Host, which was starting to report Critical Status Information in the past couple of weeks, was running NSClient++ 0.3.9 64 bit. I installed the "newer" 0.4.1.105 64 bit version at this time to see if that would clear the issue.
Note: I confirmed with my colleagues, who don't work with the server, that they have not upgraded/made any changes to that Host/server. I have not done any work on that Host/server until now when I installed the newer NSClient++.
The Memory Usage, the CPU Usage, the Drive C: Disk Usage, and the IIS Web Server checks appear to flap and give the following Status Information message:
“CRITICAL - Socket timeout after 10 seconds”.
The Critical clears after a bit, with the following:
1). Memory Usage - “Memory usage: total:32592.11 MB - used: 2519.97 MB (8%) - free: 30072.13 MB (92%)” <-- which is within the threshold checks of warn on 90% and go critical on 95%
2). IIS Web Server - “W3SVC: Started” (the Windows server web service)
3). CPU Usage - “CPU Load 0% (5 min average)”. <-- I'm sure that this was a particular point in time reference when I looked.
4). Drive C: Disk Usage - C:\ - total: 68.33 Gb - used: 51.48 Gb (75%) - free 16.85 Gb (25%) <-- which is within the threshold checks of warn on 90% and go critical on 95%.
This doesn’t happen with the other Hosts and Services that are set up and reporting with Nagios XI (just over 300 Windows and Linux servers/Hosts) and over 1300 total Services being monitored).
Suggestions??
Thank you in advance.
UPDATE: These are the checks used:
Drive C: Disk Usage:
https://<my_web_site_URL>/nagiosxi/includes/components/xicore/status.php?show=servicedetail&host=<my_Server_FQDN>&service=Drive+C%3A+Disk+Usage&dest=auto
Status Information: "CRITICAL - Socket timeout after 10 seconds"
check_xi_service_nsclient!<edited>!USEDDISKSPACE!-l C -w 90 -c 95
IIS Web Server:
https://<my_web_site_URL>/nagiosxi/includes/components/xicore/status.php?show=servicedetail&host=<my_Server_FQDN>&service=IIS+Web+Server&dest=auto
Status Information: "CRITICAL - Socket timeout after 10 seconds"
check_xi_service_nsclient!<edited>!SERVICESTATE!-l W3SVC -d SHOWALL
Memory Usage:
https://<my_web_site_URL>/nagiosxi/includes/components/xicore/status.php?show=servicedetail&host=<my_Server_FQDN>&service=Memory+Usage&dest=auto
Status Information: "could not fetch information from server"
check_xi_service_nsclient!<edited>!MEMUSE!-w 90 -c 95
We are using Nagios XI 2014R2.6, and a Host, which was starting to report Critical Status Information in the past couple of weeks, was running NSClient++ 0.3.9 64 bit. I installed the "newer" 0.4.1.105 64 bit version at this time to see if that would clear the issue.
Note: I confirmed with my colleagues, who don't work with the server, that they have not upgraded/made any changes to that Host/server. I have not done any work on that Host/server until now when I installed the newer NSClient++.
The Memory Usage, the CPU Usage, the Drive C: Disk Usage, and the IIS Web Server checks appear to flap and give the following Status Information message:
“CRITICAL - Socket timeout after 10 seconds”.
The Critical clears after a bit, with the following:
1). Memory Usage - “Memory usage: total:32592.11 MB - used: 2519.97 MB (8%) - free: 30072.13 MB (92%)” <-- which is within the threshold checks of warn on 90% and go critical on 95%
2). IIS Web Server - “W3SVC: Started” (the Windows server web service)
3). CPU Usage - “CPU Load 0% (5 min average)”. <-- I'm sure that this was a particular point in time reference when I looked.
4). Drive C: Disk Usage - C:\ - total: 68.33 Gb - used: 51.48 Gb (75%) - free 16.85 Gb (25%) <-- which is within the threshold checks of warn on 90% and go critical on 95%.
This doesn’t happen with the other Hosts and Services that are set up and reporting with Nagios XI (just over 300 Windows and Linux servers/Hosts) and over 1300 total Services being monitored).
Suggestions??
Thank you in advance.
UPDATE: These are the checks used:
Drive C: Disk Usage:
https://<my_web_site_URL>/nagiosxi/includes/components/xicore/status.php?show=servicedetail&host=<my_Server_FQDN>&service=Drive+C%3A+Disk+Usage&dest=auto
Status Information: "CRITICAL - Socket timeout after 10 seconds"
check_xi_service_nsclient!<edited>!USEDDISKSPACE!-l C -w 90 -c 95
IIS Web Server:
https://<my_web_site_URL>/nagiosxi/includes/components/xicore/status.php?show=servicedetail&host=<my_Server_FQDN>&service=IIS+Web+Server&dest=auto
Status Information: "CRITICAL - Socket timeout after 10 seconds"
check_xi_service_nsclient!<edited>!SERVICESTATE!-l W3SVC -d SHOWALL
Memory Usage:
https://<my_web_site_URL>/nagiosxi/includes/components/xicore/status.php?show=servicedetail&host=<my_Server_FQDN>&service=Memory+Usage&dest=auto
Status Information: "could not fetch information from server"
check_xi_service_nsclient!<edited>!MEMUSE!-w 90 -c 95
Newbie '14
Re: Flapping issue...
Let's rule out firewall issues. Run the following command and show us the output:
Also, show the output of:
Code: Select all
nmap <client ip> -p 12489Code: Select all
time /usr/local/nagios/libexec/check_nt -H <client ip> -s <password> -p 12489 -v USEDDISKSPACE -l C -w 90 -c 95Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Flapping issue...
Sorry for the delay.
Here are the results of your requests:
1). Response from "nmap <client ip> -p 12489":
Starting Nmap 5.51 ( http://nmap.org ) at 2015-04-30 16:19 CDT
Nmap scan report for <Server FQDN> (Server_IP)
Host is up (0.00067s latency).
PORT STATE SERVICE
12489/tcp open unknown
Nmap done: 1 IP address (1 host up) scanned in 0.09 seconds
2). Response from "time /usr/local/nagios/libexec/check_nt -H <client ip> -s <password> -p 12489 -v USEDDISKSPACE -l C -w 90 -c 95":
C:\ - total: 68.33 Gb - used: 51.49 Gb (75%) - free 16.84 Gb (25%) | 'C:\ Used Space'=51.49Gb;61.50;64.91;0.00;68.33
real 0m0.004s
user 0m0.002s
sys 0m0.001s
Here are the results of your requests:
1). Response from "nmap <client ip> -p 12489":
Starting Nmap 5.51 ( http://nmap.org ) at 2015-04-30 16:19 CDT
Nmap scan report for <Server FQDN> (Server_IP)
Host is up (0.00067s latency).
PORT STATE SERVICE
12489/tcp open unknown
Nmap done: 1 IP address (1 host up) scanned in 0.09 seconds
2). Response from "time /usr/local/nagios/libexec/check_nt -H <client ip> -s <password> -p 12489 -v USEDDISKSPACE -l C -w 90 -c 95":
C:\ - total: 68.33 Gb - used: 51.49 Gb (75%) - free 16.84 Gb (25%) | 'C:\ Used Space'=51.49Gb;61.50;64.91;0.00;68.33
real 0m0.004s
user 0m0.002s
sys 0m0.001s
Newbie '14
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: Flapping issue...
It sounds like a funky network issue - like maybe some arp poisoning (multiple machines using 1 IP) or some such - that almost can't be the case though if you're not getting host alerts. What is your host check command and interval?
Re: Flapping issue...
"Monitor the host with this command" - this is not configured/has a blank field. Same goes with many of my other Hosts, but those work with no issues.jdalrymple wrote:It sounds like a funky network issue - like maybe some arp poisoning (multiple machines using 1 IP) or some such - that almost can't be the case though if you're not getting host alerts. What is your host check command and interval?
Newbie '14
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: Flapping issue...
Inherited from template no doubt. I suggest you read up on templates and inheritance if you care to understand how that works.PhilG wrote:"Monitor the host with this command" - this is not configured/has a blank field.
Are you saying that you are getting host alerts for the hosts with problematic services?PhilG wrote:Same goes with many of my other Hosts, but those work with no issues.
Re: Flapping issue...
Well, it is a Microsoft Windows IIS server. Need I say anymore?jdalrymple wrote:Are you saying that you are getting host alerts for the hosts with problematic services?PhilG wrote:"Monitor the host with this command" - this is not configured/has a blank field.
Newbie '14
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: Flapping issue...
I'm sorry, I'm not quite following. Does that mean that you are indeed also getting host alerts in addition to your service alerts?
Re: Flapping issue...
jdalrymple wrote:I'm sorry, I'm not quite following. Does that mean that you are indeed also getting host alerts in addition to your service alerts?
Host is fine. I find it odd that this is the only Host that is getting
" CRITICAL - Socket timeout after 10 seconds"
on CPU Usage, C: disk partition, IIS Web server, and whatever else is monitors. Sometimes it's all services, sometimes its not.
Newbie '14
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: Flapping issue...
Based upon what you've said the only suggestions I would have are to either revert to the older nsclient++ version (or maybe upgrade to a newer one), or get over to http://forums.nsclient.org and see if there are any suggestions by the developer.
The fact that there are no host alerts almost entirely narrows the problem down to the scope of the nsclient service, unless of course you've fiddled with your host command. Since you weren't aware that it was coming down from a template I'd guess that's almost gotta be a "no".
The fact that there are no host alerts almost entirely narrows the problem down to the scope of the nsclient service, unless of course you've fiddled with your host command. Since you weren't aware that it was coming down from a template I'd guess that's almost gotta be a "no".