Page 1 of 2
[Solved] Nagios and random snmp error
Posted: Mon Apr 14, 2014 7:23 am
by notic
Hi!
I have next setup which works well until i added aditional 250 checks (sum of all 492).
Configuration: Nagios Core 3.5.0 + Ndo2db--->connected to WAGO 750-880
Last week i was looking at event log and found a lot of this lines:
[14-04-2014 13:53:40] SERVICE ALERT: Sys_room1;Fire alarm;OK;SOFT;2;OK :0 state
[14-04-2014 13:52:45] SERVICE ALERT: Sys_room1;Fire alarm;UNKNOWN;SOFT;1;SNMP REQUEST ERROR : No response from remote host '192.168.21.1'.
Strange thing here is that if i force same check after i get error, right vaule returns. Pattern from this errors is unknown.
If anyone would give me some pointers where to look i would be very happy.
Thank you.
Re: Nagios and random snmp error
Posted: Mon Apr 14, 2014 2:27 pm
by sreinhardt
How often is this happening? What kind of load and disk io do you have on the system when this is happening?
Re: Nagios and random snmp error
Posted: Tue Apr 15, 2014 12:24 am
by notic
hi!
Thank you.
This is completely random error for random service. Let's say that i have 250 checks from one host (wago), 10% random checks will fail. It fails thru all day.
Configuration is IBM X-Blade with 4 proc+3GB RAM+SCSCI 40GB Hard drive
iostat
Code: Select all
Linux 2.6.32-431.11.2.el6.x86_64 (nagios) 04/15/2014 _x86_64_ (4 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
5.82 0.00 1.53 0.37 0.00 92.28
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
vda 59.36 17.86 826.48 1493650 69119066
dm-0 104.23 17.76 826.48 1485154 69119088
dm-1 0.00 0.03 0.00 2376 0
free -m
Code: Select all
total used free shared buffers cached
Mem: 2887 1579 1307 0 163 957
-/+ buffers/cache: 457 2429
Swap: 3023 0 3023
top | head -n 2
Code: Select all
top - 07:24:04 up 23:16, 1 user, load average: 0.54, 0.58, 0.53
Tasks: 134 total, 4 running, 130 sleeping, 0 stopped, 0 zombie
Re: Nagios and random snmp error
Posted: Tue Apr 15, 2014 10:29 am
by sreinhardt
OK, so we don't have a whole lot of checks, less than 500 is pretty light. You don't have excessive disk, memory, or cpu utilization. What plugin are you using to check these devices?
Re: Nagios and random snmp error
Posted: Wed Apr 16, 2014 5:02 am
by notic
Hi.
Plugins that are used for this host are: check_snmp and check_centreon_snmp_value. Both checks gives me random errors. Funny thing is that same checks work fine on other hosts.
I
Re: Nagios and random snmp error
Posted: Wed Apr 16, 2014 3:57 pm
by sreinhardt
Um, so are you using a nagios based monitoring solution? Based on past times I have seen this, it is usually either a ulimit for max connections, or an issue on the way to the remote device like the packet getting dropped. It is udp afterall.
Re: Nagios and random snmp error
Posted: Wed Apr 23, 2014 5:51 am
by notic
Hi!
i set up new configuration with Nagios Core on same hardware conf as before. Same number of checks, same errors. I tried to use only check_snmp this time.
[root@nagios ~]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 22951
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 22951
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Re: Nagios and random snmp error
Posted: Wed Apr 23, 2014 11:17 am
by sreinhardt
Considering it is happening with a new system, and your ulimits seem well within reason, it really would seem to be more of an issue with that device. How long have you been monitoring this device with snmp? How many checks are you running against it at one time, say within a 5 minute window? What kind of device, model and manufacturer is it?
Re: Nagios and random snmp error
Posted: Wed Apr 23, 2014 5:52 pm
by notic
Hi,
device is brand new and it's wago 750-880 eth controller. Right now i am running against it 115 snmp checks...but next week i will need to add 90 new checks to it.
Right now checks are in 3 time groups(important "realtime" values,states,"constants"). 50checks = 3min,30checks= 20min, 35checks = 70 min window.
Funny thing is that i am runing almost 200 checks against each cisco catalyst switch without any problems.
Re: Nagios and random snmp error
Posted: Thu Apr 24, 2014 9:53 am
by sreinhardt
Considering your cisco devices handle it just fine, that leaves other networking devices as being in working order. I honestly think you may be hitting a limit on what that device can process at one time. However looking at the
documentation they provide for it, there may be an oid that we can walk and get more information.
Code: Select all
snmpwalk -v 2c -c [community string] -t 30 -O n [host\IP] 1.3.6.1.2.1.11