Page 1 of 5
Latency Uptime Warnings/Alerts and Juniper Switches
Posted: Thu Nov 30, 2017 1:39 pm
by emssa
Hello,
We are seeing random WARNING - deviceXXX: rta
1549.852ms, lost 80%
Also getting HTTP/s down/critical but can see the page.. which follows usually with a 302 error recovery loss of N seconds.
Even though nothing has changed with a redirect.
Are you aware of latency issues across the board with Juniper devices? Has
anyone else brought this up in the community?
What concerns me with this is that Nagios warnings are based on false
positive readings of our devices per our network.
https://kb.juniper.net/InfoCenter/index ... id=KB28157
via the article they say tens and we are seeing differences of hundreds and thousands
Is this correct?
Thanks,
Brad
Re: Latency Uptime Warnings/Alerts and Juniper Switches
Posted: Thu Nov 30, 2017 5:00 pm
by dwhitfield
I am not aware of any Juniper-specific issues. Just to make sure, I asked around the office and nothing is known (doesn't mean there's not a new issue, of course). That said, there are some juniper-specific plugins if ICMP isn't going to work well for a Juniper device:
https://exchange.nagios.org//directory/ ... sh/details
Are the http issues the web interface for the switch?
I'd be interested in seeing a traceroute to both the switch and the web server (if they aren't the same thing). It's possible these are separate issues, but if the switch is having issues and you need that to get to the webserver, that would start to make sense. If the desktop doesn't need the switch, but XI does, things start to make even more sense.
Most likely, if the issue was with XI you'd see plugin timeouts, but just to be sure, can you PM me your Profile? You can download it by going to Admin > System Config > System Profile and click the ***Download Profile*** button towards the top. If for whatever reason you *cannot* download the profile, please put the output of View System Info (5.3.4+, Show Profile if older) in the thread (that will at least get us some info). This will give us access to many of the logs we would otherwise ask for individually. If security is a concern, you can unzip the profile take out what you like, and then zip it up again. We may end up needing something you remove, but we can ask for that specifically.
You can also generate a profile manually using the script at /usr/local/nagiosxi/html/includes/components/profile/getprofile.sh
That should generate a profile in /usr/local/nagiosxi/var/components/ which you can get off the server with an application such as FileZilla.
After you PM the profile, please update this thread. Updating this thread is the only way for it to show back up on our dashboard.
If you get an error that PROFILE BUILD FAILED, please see
https://support.nagios.com/kb/article.p ... ategory=44
Re: Latency Uptime Warnings/Alerts and Juniper Switches
Posted: Fri Dec 01, 2017 1:53 pm
by emssa
These are false positives due to how Juniper switches use CoS or giving ICMP a lower priority for devices plugged into the Juniper switches. We do not actually manage the switches nor do we have access to them.
The only issue i see with Xi is that it is on a server plugged into a Juniper therefore open to the CoS priority, then I am sure that if we had another Xi instance it would complain and sent alerts that the other was down or had latency issues etc. Similar to what we are seeing with all of our devices such as pdu's, apc's, sans, servers and (idrac/ilos).
But that is the key here is that we now know with certainty that we are not sure if the alert or warning is a false-positive due to this Juniper CoS feature that breaks any device or service relying on proper latency gathering from the network.
Re: Latency Uptime Warnings/Alerts and Juniper Switches
Posted: Fri Dec 01, 2017 1:59 pm
by dwhitfield
I think I understand most of the issue now, but I'm still a bit confused about the http server. Were you testing a webserver using ping? Your hosts don't have to use ping. They can use whatever check you want. In fact, you don't even have to set host checks at all if you just want to use services.
Re: Latency Uptime Warnings/Alerts and Juniper Switches
Posted: Fri Dec 01, 2017 2:45 pm
by emssa
Yeah but we want to know when a host is having true latency or ping problems, how else will we know when network issues arise?
This also effects things like san performance balancing and HA clusters with corosync and stonith or pacemaker. Things that actively decide thresholds and action based on returned latency.
So we really do not know what stasus our network or attached devices are in with that feature.
Plus we are getting a bunch of alerts that are false-positives.
Just wanted to reach out to see if anyone else was seeing this and if there was a work around other than going to another switch vendor.
Re: Latency Uptime Warnings/Alerts and Juniper Switches
Posted: Fri Dec 01, 2017 3:00 pm
by dwhitfield
It certainly doesn't matter to us if you stick with Junos or not, but you don't need to use icmp ping unless that's a business requirement. You could give this a shot: yum install tcping
Re: Latency Uptime Warnings/Alerts and Juniper Switches
Posted: Mon Dec 04, 2017 12:08 pm
by emssa
dwhitfield wrote:It certainly doesn't matter to us if you stick with Junos or not, but you don't need to use icmp ping unless that's a business requirement. You could give this a shot: yum install tcping
Gave it a shot and still seeing alerts to services that are supposedly down but they are not down.
Thanks though
Re: Latency Uptime Warnings/Alerts and Juniper Switches
Posted: Mon Dec 04, 2017 12:32 pm
by dwhitfield
Can you show me the host config for the host you changed to stop using icmp? Also, can you send the nagios.log for the day in which it is failing?
Re: Latency Uptime Warnings/Alerts and Juniper Switches
Posted: Mon Dec 04, 2017 12:43 pm
by emssa
Was I supposed to add an ignore all ipv4 to sysctl? Is that what you mean by host config?
Re: Latency Uptime Warnings/Alerts and Juniper Switches
Posted: Mon Dec 04, 2017 1:11 pm
by dwhitfield
It's not clear if you meant you tried tcping from the command line or set it up as a host command. I suppose if you are seeing issues from the command line, then this is pretty clearly an issue with the network in some regard.
What I mean when I say host config is something like the following:
Code: Select all
define host {
host_name 172.26.241.185
alias Nagios
address 172.26.241.185
check_command check_ping!200.0,20%!400.0,90%!!!!!!
max_check_attempts 1
check_interval 5
retry_interval 3
register 1
}
You can just PM me another profile, if that's easier. The nagios.log in the profile is only a tail, so that's why I didn't ask for that initially. I want to make sure I am seeing the right statuses in the nagios.log. Also, sometimes permissions mean the nagios.log doesn't come in the profile and I really want to see the logs associated with one of the hosts in question.