Page 1 of 1
Services dropping off
Posted: Tue Jun 08, 2010 9:28 am
by dxf1
I have now deployed a live VM at a datacentre which appears to perform great, I have one issue which is periodically the odd service shows a critical condition but in reality the is no issue
I have set the VM to have a fixed ip address
[root@localhost network-scripts]# cat ifcfg-eth0
# Advanced Micro Devices [AMD] 79c970 [PCnet32 LANCE]
DEVICE=eth0
BOOTPROTO=none
HWADDR=00:0c:29:66:94:f8
IPADDR=10.228.100.41
NETMASK=255.255.255.192
GATEWAY=10.228.100.62
Type=Ethernet
IPV6INIT=no
ONBOOT=yes
USERCTL=no
PEERDNS=yes
ETHTOOL_OPTS="speed 100 duplex full autoneg on"
With the services that are giving a false positive this is preventing the VM from going live
Regards Dave
Re: Services dropping off
Posted: Tue Jun 08, 2010 9:42 am
by mmestnik
Something is causing these. Nagios Core, as used by NagiosXI, is widely deployed and tested with more then 10 years of history, though the current version 3.x is fairly new it's not vary likely to be causing you problems.
Could you provide more information about the VM, what it's running on and what software is being used? Some more information about the network environment related to these outages?
Does this effect all the checks or just a select few? Look for commonalities, a switch or router. Are all the network devices used during these checks also being monitored?
There are several points that could be injecting these, you should work toured eliminating them one by one as impossible causes. Nagios Core and the check commands can get a nearly impossible rating. The best thing to do is to work every thing else up to at least that level. Then we should try and discover any problems related to NagiosXI.
I've seen some switches provided by prominent vendor behave this way, by it's not vary likely that you would be using the same brand.
Re: Services dropping off
Posted: Wed Jun 09, 2010 3:07 am
by dxf1
Hi The NaxiosXI VM is running on VMPlayer version 3.0.1 build-227600 the host OS is a redhat 4.6 32 bit environment
This is the only thing running on the server at this time, the checks are only against as4000 i5 servers I do intend to include the routers at a later date but this is the initial release into live I am using the plugin check_as400 which is running fine on a couple of Nagios 2.x systems at our local location.
The reason for having a local server at the customers site is the losses that the vpn over the internet has are causiing false posatives are incorrect alerts
Thanks Dave
Re: Services dropping off
Posted: Wed Jun 09, 2010 9:39 am
by mmestnik
So the checks work from your remote location... but from a local(to the server) connection they are not?
This is to be expected as the path to the internet is often better maintained then the hardly every used paths between servers. Add ping checks for all the intermediary routers, this will help you spot the faling link. I expect you to discover a faulty Ethernet cable or network interface. Keep in mind Cisco and Sun gear, even the 1000, often needs to be pinned on both ends to 100FD.
As indicated some switches just operate this way, dropping random packets. It may be that your check is SNMP/UDP based and won't do well in those environments. What brand is all of your equipment and how long(or short!) are the Ethernet runs?
Re: Services dropping off
Posted: Thu Jun 10, 2010 2:19 am
by dxf1
Sorry you miss understand, the checks at the US data centre are having this issue I am not aware of any of the logistics of the data centre as I am in the UK I can't easily go see.
What I meant in the earlier post is that I have used this plugin a java based check_as400 withoout issue in the UK this is the first time I have had issues where the checks miss intermittantly and show faulse posatives.
I belive that the routers and firewalls are all of Cisco origin and are managed by Sungard for us.
Could you please confirm the config for eth0 is OK in the earlier post this is within the VM, this is now urgent as the customer is expecting a delivery of this functionality
Best Regards, Dave
Re: Services dropping off
Posted: Thu Jun 10, 2010 10:18 am
by mmestnik
There is a jack at the back of some device that is connected to an Ethernet cable... no?
You have an issue with the logistics of the data center. I've worked with Sungard b4 my co-workers always thought they were great help. However I wasn't impressed by remote hands working blindly, streaming video would have been a nice touch.
You are going to need to do some pretty low level analysis to work out these kinks. The first thing I'd do is try setting your interfaces to 100FD, if you do the server then the switch perhaps you won't need remote hands. You should do this anytime Cisco gear connects to a non-cisco device, this is especially true for Sun servers and does effect other equipment.
"The path from your Nagios server to the devices is dropping packets."
Re: Services dropping off
Posted: Tue Jun 22, 2010 2:00 am
by dxf1
Hi ,
My issue is that the VM has a very high latency the network from the Host is OK but the latency is very poor (slow) from the VM thw as400 plugin does not have a timeout setting so I am missing up to 10% of the checks intermittantly
Any Thoughts ???
Dave
Re: Services dropping off
Posted: Tue Jun 22, 2010 10:06 am
by mmestnik
Having check times greater then 2 or 3 seconds may cause the scheduler in Nagios to eternally fall behind, it could be hours for all the checks to complete and the checks would be performed once every few hours.
That said there are a number of tunable parameters, not sure which one applies to you. Though it sounds like you are hitting the [1][2]timeout(s).
1.
http://nagios.sourceforge.net/docs/3_0/ ... ck_timeout
2.
http://nagios.sourceforge.net/docs/3_0/ ... ck_timeout
These stop Nagios from measuring delays greater than these values. YMMV.