[Solved] Nagios and random snmp error

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
notic
Posts: 14
Joined: Tue Dec 31, 2013 5:24 am

[Solved] Nagios and random snmp error

Post by notic »

Hi!

I have next setup which works well until i added aditional 250 checks (sum of all 492).

Configuration: Nagios Core 3.5.0 + Ndo2db--->connected to WAGO 750-880

Last week i was looking at event log and found a lot of this lines:
[14-04-2014 13:53:40] SERVICE ALERT: Sys_room1;Fire alarm;OK;SOFT;2;OK :0 state

[14-04-2014 13:52:45] SERVICE ALERT: Sys_room1;Fire alarm;UNKNOWN;SOFT;1;SNMP REQUEST ERROR : No response from remote host '192.168.21.1'.

Strange thing here is that if i force same check after i get error, right vaule returns. Pattern from this errors is unknown.

If anyone would give me some pointers where to look i would be very happy.

Thank you.
Last edited by notic on Tue May 06, 2014 6:31 am, edited 1 time in total.
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: Nagios and random snmp error

Post by sreinhardt »

How often is this happening? What kind of load and disk io do you have on the system when this is happening?

Code: Select all

top | head -n 2
free -m
iostat
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
notic
Posts: 14
Joined: Tue Dec 31, 2013 5:24 am

Re: Nagios and random snmp error

Post by notic »

hi!

Thank you.
This is completely random error for random service. Let's say that i have 250 checks from one host (wago), 10% random checks will fail. It fails thru all day.

Configuration is IBM X-Blade with 4 proc+3GB RAM+SCSCI 40GB Hard drive

iostat

Code: Select all

Linux 2.6.32-431.11.2.el6.x86_64 (nagios)      04/15/2014      _x86_64_        (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.82    0.00    1.53    0.37    0.00   92.28

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
vda              59.36        17.86       826.48    1493650   69119066
dm-0            104.23        17.76       826.48    1485154   69119088
dm-1              0.00         0.03         0.00       2376          0
free -m

Code: Select all

 total       used       free     shared    buffers     cached
Mem:          2887       1579       1307          0        163        957
-/+ buffers/cache:        457       2429
Swap:         3023          0       3023
top | head -n 2

Code: Select all

top - 07:24:04 up 23:16,  1 user,  load average: 0.54, 0.58, 0.53
Tasks: 134 total,   4 running, 130 sleeping,   0 stopped,   0 zombie
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: Nagios and random snmp error

Post by sreinhardt »

OK, so we don't have a whole lot of checks, less than 500 is pretty light. You don't have excessive disk, memory, or cpu utilization. What plugin are you using to check these devices?
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
notic
Posts: 14
Joined: Tue Dec 31, 2013 5:24 am

Re: Nagios and random snmp error

Post by notic »

Hi.

Plugins that are used for this host are: check_snmp and check_centreon_snmp_value. Both checks gives me random errors. Funny thing is that same checks work fine on other hosts.
I
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: Nagios and random snmp error

Post by sreinhardt »

Um, so are you using a nagios based monitoring solution? Based on past times I have seen this, it is usually either a ulimit for max connections, or an issue on the way to the remote device like the packet getting dropped. It is udp afterall.

Code: Select all

ulimit -a
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
notic
Posts: 14
Joined: Tue Dec 31, 2013 5:24 am

Re: Nagios and random snmp error

Post by notic »

Hi!

i set up new configuration with Nagios Core on same hardware conf as before. Same number of checks, same errors. I tried to use only check_snmp this time.

[root@nagios ~]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 22951
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 22951
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: Nagios and random snmp error

Post by sreinhardt »

Considering it is happening with a new system, and your ulimits seem well within reason, it really would seem to be more of an issue with that device. How long have you been monitoring this device with snmp? How many checks are you running against it at one time, say within a 5 minute window? What kind of device, model and manufacturer is it?
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
notic
Posts: 14
Joined: Tue Dec 31, 2013 5:24 am

Re: Nagios and random snmp error

Post by notic »

Hi,

device is brand new and it's wago 750-880 eth controller. Right now i am running against it 115 snmp checks...but next week i will need to add 90 new checks to it.
Right now checks are in 3 time groups(important "realtime" values,states,"constants"). 50checks = 3min,30checks= 20min, 35checks = 70 min window.

Funny thing is that i am runing almost 200 checks against each cisco catalyst switch without any problems.
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: Nagios and random snmp error

Post by sreinhardt »

Considering your cisco devices handle it just fine, that leaves other networking devices as being in working order. I honestly think you may be hitting a limit on what that device can process at one time. However looking at the documentation they provide for it, there may be an oid that we can walk and get more information.

Code: Select all

snmpwalk -v 2c -c [community string] -t 30 -O n [host\IP] 1.3.6.1.2.1.11
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
Locked