Page 1 of 1

Soft state counter decreasing?!?!?

Posted: Tue Oct 27, 2015 1:25 pm
by zzt
Hello, the other day we had a switch go down, and nagios got a little weird. The ping failure brought it to a soft state, and it started incrementing towards our defined max of 5. At some point the state counter went from 3, to 2, then back to 3...it ended up getting to 5 and alerting...but I don't know why it would decrement in the first place. Has anyone else ever seen this behavior? We're running a basically stock setup of Nagios 3.5.1.

Syslog:
Oct 26 19:21:59 nagios01 nagios3: HOST ALERT: r2r10tor2;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
Oct 26 19:22:49 nagios01 nagios3: HOST ALERT: r2r10tor2;DOWN;SOFT;2;PING CRITICAL - Packet loss = 100%
Oct 26 19:25:29 nagios01 nagios3: HOST ALERT: r2r10tor2;DOWN;SOFT;3;PING CRITICAL - Packet loss = 100%
Oct 26 19:27:03 nagios01 nagios3: HOST ALERT: r2r10tor2;DOWN;SOFT;2;PING CRITICAL - Packet loss = 100%
Oct 26 19:29:23 nagios01 nagios3: HOST ALERT: r2r10tor2;DOWN;SOFT;3;PING CRITICAL - Packet loss = 100%
Oct 26 19:32:03 nagios01 nagios3: HOST ALERT: r2r10tor2;DOWN;SOFT;4;PING CRITICAL - Packet loss = 100%
Oct 26 19:34:43 nagios01 nagios3: HOST ALERT: r2r10tor2;DOWN;HARD;5;PING CRITICAL - Packet loss = 100%

An alert was successfully sent at this point, but this behavior is very worrisome. I don't want to get in a situation where it's just flapping between 2 & 3.

Re: Soft state counter decreasing?!?!?

Posted: Tue Oct 27, 2015 1:31 pm
by zzt
Just saw that there was similar issue with an idrac. This one stalled at 2, before moving on to the max retry:

Oct 26 22:02:09 nagios01 nagios3: HOST ALERT: server213-drac;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
Oct 26 22:04:39 nagios01 nagios3: HOST ALERT: server213-drac;DOWN;SOFT;2;PING CRITICAL - Packet loss = 100%
Oct 26 22:08:04 nagios01 nagios3: HOST ALERT: server213-drac;DOWN;SOFT;2;PING CRITICAL - Packet loss = 100%
Oct 26 22:10:14 nagios01 nagios3: HOST ALERT: server213-drac;DOWN;SOFT;3;PING CRITICAL - Packet loss = 100%
Oct 26 22:12:34 nagios01 nagios3: HOST ALERT: server213-drac;DOWN;SOFT;4;PING CRITICAL - Packet loss = 100%
Oct 26 22:14:54 nagios01 nagios3: HOST ALERT: server213-drac;DOWN;HARD;5;PING CRITICAL - Packet loss = 100%

Re: Soft state counter decreasing?!?!?

Posted: Wed Oct 28, 2015 1:50 am
by Box293
What is the output of these commands:

Code: Select all

ps -ef | grep nagios.cfg
ipcs -q

Re: Soft state counter decreasing?!?!?

Posted: Wed Oct 28, 2015 11:11 am
by zzt
ps shows the main nagios3 process (45336) and a large number of subprocesses, which in turn have subprocesses doing the actual checks. There's only one parent nagios3 process, so I've been under the assumption that this is the correct behavior. ipcs returns nothing from what I can tell.

$ ps -ef | grep nagios.cfg

Code: Select all

nagios    40521  45336  0 15:57 ?        00:00:00 /usr/sbin/nagios3 -d /etc/nagios3/nagios.cfg
nagios    40927  45336  0 15:57 ?        00:00:00 /usr/sbin/nagios3 -d /etc/nagios3/nagios.cfg
nagios    45336      1 82 08:00 ?        06:35:23 /usr/sbin/nagios3 -d /etc/nagios3/nagios.cfg
nagios    45458  45336  0 15:57 ?        00:00:00 /usr/sbin/nagios3 -d /etc/nagios3/nagios.cfg
nagios    63240  45336  0 15:57 ?        00:00:00 /usr/sbin/nagios3 -d /etc/nagios3/nagios.cfg
nagios    68877  45336  0 15:57 ?        00:00:00 /usr/sbin/nagios3 -d /etc/nagios3/nagios.cfg
...etc

$ ps -auxf
...

Code: Select all

nagios    45336 82.7  0.2 3115628 137196 ?      RNsl 08:00 399:41 /usr/sbin/nagios3 -d /etc/nagios3/nagios.cfg
nagios   123579  1.1  0.2 3115628 135080 ?      SN   16:03   0:00  \_ /usr/sbin/nagios3 -d /etc/nagios3/nagios.cfg
nagios   123643  0.2  0.0   4480   760 ?        SN   16:03   0:00  |   \_ sh -c /usr/lib/nagios/plugins/check_ping -H '192.168.0.44' -w 5000,100% -c 5000,100% -p 1
nagios   123659  0.0  0.0   8396   944 ?        SN   16:03   0:00  |       \_ /usr/lib/nagios/plugins/check_ping -H 192.168.0..44 -w 5000,100% -c 5000,100% -p 1
nagios   123660  0.0  0.0   4404   744 ?        SN   16:03   0:00  |           \_ /bin/ping -n -U -w 10 -c 1 192.168.0..44
nagios   123582  0.7  0.2 3115628 135080 ?      SN   16:03   0:00  \_ /usr/sbin/nagios3 -d /etc/nagios3/nagios.cfg
nagios   123633  0.4  0.0   4480   844 ?        SN   16:03   0:00  |   \_ sh -c /usr/lib/nagios/plugins/check_ping -H '192.168.0..10' -w 5000,100% -c 5000,100% -p 1
nagios   123667  0.0  0.0   8396  1016 ?        SN   16:03   0:00  |       \_ /usr/lib/nagios/plugins/check_ping -H 192.168.0.10 -w 5000,100% -c 5000,100% -p 1
nagios   123668  0.0  0.0   4404   800 ?        SN   16:03   0:00  |           \_ /bin/ping -n -U -w 10 -c 1 192.168.0.10
...etc

$ ipcs -q

Code: Select all

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

Re: Soft state counter decreasing?!?!?

Posted: Wed Oct 28, 2015 5:20 pm
by tmcdonald
Did you configure NDO (ndoutils, ndo2db, etc) on this machine by any chance?

Re: Soft state counter decreasing?!?!?

Posted: Wed Oct 28, 2015 5:20 pm
by Box293
I suspected multiple Nagios processes being the cause of the problem but as we can see there is only one parent nagios process.

ipcs relates to ndoutils which appears as though you're not using it.

I see you're on nagios3, I suggest upgrading to 4.1.1 as there have been many updates since then.

Are you using a load balancer like Mod_Gearman?

Re: Soft state counter decreasing?!?!?

Posted: Wed Oct 28, 2015 5:23 pm
by hsmith
Also, what was the installation method for this? Source/repo? OS? Would be nice to have this information handy for investigation.

Re: Soft state counter decreasing?!?!?

Posted: Thu Oct 29, 2015 4:23 pm
by zzt
It's running on Ubuntu 12.04.5 LTS, and installed using apt-get off our local mirror. There's no sign of NDO or Gearman either. The load average on this thing is pretty high thought. The CPUs (2 proc, 24 cores) aren't that busy, but there are a ton of processes, but I guess that's just nagios being nagios:

Code: Select all

top - 21:21:38 up 36 days, 20:28,  1 user,  load average: 33.02, 29.96, 28.02
Tasks: 795 total,  13 running, 782 sleeping,   0 stopped,   0 zombie
Cpu(s):  4.3%us, 26.1%sy, 23.8%ni, 41.9%id,  3.5%wa,  0.0%hi,  0.5%si,  0.0%st
Mem:  49411656k total, 14925488k used, 34486168k free,   482744k buffers
Swap:  4194300k total,        0k used,  4194300k free,  4807712k cached

Code: Select all

 21:21:55 up 36 days, 20:28,  1 user,  load average: 30.57, 29.59, 27.94

Re: Soft state counter decreasing?!?!?

Posted: Thu Oct 29, 2015 6:19 pm
by Box293
zzt wrote:It's running on Ubuntu 12.04.5 LTS, and installed using apt-get off our local mirror.


This is most likely core 3.x. There were some major performance improvements in 4.x and I recommend you upgrade to 4.1.1.

I do have some guides here for installing from Source on Ubuntu:
http://sites.box293.com/nagios/guides/i ... untu-14-04

However that is version 4.0.8, I don't have a guide for 4.1.1 on Ubuntu at the minute however the relevant URLs are in this CentOS guide:
http://sites.box293.com/nagios/guides/i ... centos-6-7

You would need to take a copy of your configs first and then uninstall the old version before installing from source.

You could look at installing Mod_Gearman locally which may help.

A RAM Disk would also greatly improve performance. This guide is for Nagios XI on CentOS/RHEL however the concepts are identical.
https://assets.nagios.com/downloads/nag ... giosXI.pdf