Nagios Support Forum

Posted: **Mon Apr 14, 2014 7:23 am**

Hi!

I have next setup which works well until i added aditional 250 checks (sum of all 492).

Configuration: Nagios Core 3.5.0 + Ndo2db--->connected to WAGO 750-880

Last week i was looking at event log and found a lot of this lines:
[14-04-2014 13:53:40] SERVICE ALERT: Sys_room1;Fire alarm;OK;SOFT;2;OK :0 state

[14-04-2014 13:52:45] SERVICE ALERT: Sys_room1;Fire alarm;UNKNOWN;SOFT;1;SNMP REQUEST ERROR : No response from remote host '192.168.21.1'.

Strange thing here is that if i force same check after i get error, right vaule returns. Pattern from this errors is unknown.

If anyone would give me some pointers where to look i would be very happy.

Thank you.

Posted: **Mon Apr 14, 2014 2:27 pm**

How often is this happening? What kind of load and disk io do you have on the system when this is happening?

Code: Select all

top | head -n 2
free -m
iostat

Posted: **Tue Apr 15, 2014 12:24 am**

hi!

Thank you.
This is completely random error for random service. Let's say that i have 250 checks from one host (wago), 10% random checks will fail. It fails thru all day.

Configuration is IBM X-Blade with 4 proc+3GB RAM+SCSCI 40GB Hard drive

iostat

Code: Select all

Linux 2.6.32-431.11.2.el6.x86_64 (nagios)      04/15/2014      _x86_64_        (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.82    0.00    1.53    0.37    0.00   92.28

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
vda              59.36        17.86       826.48    1493650   69119066
dm-0            104.23        17.76       826.48    1485154   69119088
dm-1              0.00         0.03         0.00       2376          0

free -m

Code: Select all

 total       used       free     shared    buffers     cached
Mem:          2887       1579       1307          0        163        957
-/+ buffers/cache:        457       2429
Swap:         3023          0       3023

top | head -n 2

Code: Select all

top - 07:24:04 up 23:16,  1 user,  load average: 0.54, 0.58, 0.53
Tasks: 134 total,   4 running, 130 sleeping,   0 stopped,   0 zombie

Posted: **Tue Apr 15, 2014 10:29 am**

OK, so we don't have a whole lot of checks, less than 500 is pretty light. You don't have excessive disk, memory, or cpu utilization. What plugin are you using to check these devices?

Posted: **Wed Apr 16, 2014 5:02 am**

Hi.

Plugins that are used for this host are: check_snmp and check_centreon_snmp_value. Both checks gives me random errors. Funny thing is that same checks work fine on other hosts.
I

Posted: **Wed Apr 16, 2014 3:57 pm**

Um, so are you using a nagios based monitoring solution? Based on past times I have seen this, it is usually either a ulimit for max connections, or an issue on the way to the remote device like the packet getting dropped. It is udp afterall.

Code: Select all

ulimit -a

Posted: **Wed Apr 23, 2014 5:51 am**

Hi!

i set up new configuration with Nagios Core on same hardware conf as before. Same number of checks, same errors. I tried to use only check_snmp this time.

[root@nagios ~]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 22951
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 22951
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Posted: **Wed Apr 23, 2014 11:17 am**

Considering it is happening with a new system, and your ulimits seem well within reason, it really would seem to be more of an issue with that device. How long have you been monitoring this device with snmp? How many checks are you running against it at one time, say within a 5 minute window? What kind of device, model and manufacturer is it?

Posted: **Wed Apr 23, 2014 5:52 pm**

Hi,

device is brand new and it's wago 750-880 eth controller. Right now i am running against it 115 snmp checks...but next week i will need to add 90 new checks to it.
Right now checks are in 3 time groups(important "realtime" values,states,"constants"). 50checks = 3min,30checks= 20min, 35checks = 70 min window.

Funny thing is that i am runing almost 200 checks against each cisco catalyst switch without any problems.

Posted: **Thu Apr 24, 2014 9:53 am**

Considering your cisco devices handle it just fine, that leaves other networking devices as being in working order. I honestly think you may be hitting a limit on what that device can process at one time. However looking at the documentation they provide for it, there may be an oid that we can walk and get more information.

Code: Select all

snmpwalk -v 2c -c [community string] -t 30 -O n [host\IP] 1.3.6.1.2.1.11

Nagios Support Forum

[Solved] Nagios and random snmp error

[Solved] Nagios and random snmp error

Re: Nagios and random snmp error

Re: Nagios and random snmp error

Re: Nagios and random snmp error

Re: Nagios and random snmp error

Re: Nagios and random snmp error

Re: Nagios and random snmp error

Re: Nagios and random snmp error

Re: Nagios and random snmp error

Re: Nagios and random snmp error