Page 1 of 1
Soft state counter decreasing?!?!?
Posted: Tue Oct 27, 2015 1:25 pm
by zzt
Hello, the other day we had a switch go down, and nagios got a little weird. The ping failure brought it to a soft state, and it started incrementing towards our defined max of 5. At some point the state counter went from 3, to 2, then back to 3...it ended up getting to 5 and alerting...but I don't know why it would decrement in the first place. Has anyone else ever seen this behavior? We're running a basically stock setup of Nagios 3.5.1.
Syslog:
Oct 26 19:21:59 nagios01 nagios3: HOST ALERT: r2r10tor2;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
Oct 26 19:22:49 nagios01 nagios3: HOST ALERT: r2r10tor2;DOWN;SOFT;2;PING CRITICAL - Packet loss = 100%
Oct 26 19:25:29 nagios01 nagios3: HOST ALERT: r2r10tor2;DOWN;SOFT;3;PING CRITICAL - Packet loss = 100%
Oct 26 19:27:03 nagios01 nagios3: HOST ALERT: r2r10tor2;DOWN;SOFT;2;PING CRITICAL - Packet loss = 100%
Oct 26 19:29:23 nagios01 nagios3: HOST ALERT: r2r10tor2;DOWN;SOFT;3;PING CRITICAL - Packet loss = 100%
Oct 26 19:32:03 nagios01 nagios3: HOST ALERT: r2r10tor2;DOWN;SOFT;4;PING CRITICAL - Packet loss = 100%
Oct 26 19:34:43 nagios01 nagios3: HOST ALERT: r2r10tor2;DOWN;HARD;5;PING CRITICAL - Packet loss = 100%
An alert was successfully sent at this point, but this behavior is very worrisome. I don't want to get in a situation where it's just flapping between 2 & 3.
Re: Soft state counter decreasing?!?!?
Posted: Tue Oct 27, 2015 1:31 pm
by zzt
Just saw that there was similar issue with an idrac. This one stalled at 2, before moving on to the max retry:
Oct 26 22:02:09 nagios01 nagios3: HOST ALERT: server213-drac;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
Oct 26 22:04:39 nagios01 nagios3: HOST ALERT: server213-drac;DOWN;SOFT;2;PING CRITICAL - Packet loss = 100%
Oct 26 22:08:04 nagios01 nagios3: HOST ALERT: server213-drac;DOWN;SOFT;2;PING CRITICAL - Packet loss = 100%
Oct 26 22:10:14 nagios01 nagios3: HOST ALERT: server213-drac;DOWN;SOFT;3;PING CRITICAL - Packet loss = 100%
Oct 26 22:12:34 nagios01 nagios3: HOST ALERT: server213-drac;DOWN;SOFT;4;PING CRITICAL - Packet loss = 100%
Oct 26 22:14:54 nagios01 nagios3: HOST ALERT: server213-drac;DOWN;HARD;5;PING CRITICAL - Packet loss = 100%
Re: Soft state counter decreasing?!?!?
Posted: Wed Oct 28, 2015 1:50 am
by Box293
What is the output of these commands:
Re: Soft state counter decreasing?!?!?
Posted: Wed Oct 28, 2015 11:11 am
by zzt
ps shows the main nagios3 process (45336) and a large number of subprocesses, which in turn have subprocesses doing the actual checks. There's only one parent nagios3 process, so I've been under the assumption that this is the correct behavior. ipcs returns nothing from what I can tell.
$ ps -ef | grep nagios.cfg
Code: Select all
nagios 40521 45336 0 15:57 ? 00:00:00 /usr/sbin/nagios3 -d /etc/nagios3/nagios.cfg
nagios 40927 45336 0 15:57 ? 00:00:00 /usr/sbin/nagios3 -d /etc/nagios3/nagios.cfg
nagios 45336 1 82 08:00 ? 06:35:23 /usr/sbin/nagios3 -d /etc/nagios3/nagios.cfg
nagios 45458 45336 0 15:57 ? 00:00:00 /usr/sbin/nagios3 -d /etc/nagios3/nagios.cfg
nagios 63240 45336 0 15:57 ? 00:00:00 /usr/sbin/nagios3 -d /etc/nagios3/nagios.cfg
nagios 68877 45336 0 15:57 ? 00:00:00 /usr/sbin/nagios3 -d /etc/nagios3/nagios.cfg
...etc
$ ps -auxf
...
Code: Select all
nagios 45336 82.7 0.2 3115628 137196 ? RNsl 08:00 399:41 /usr/sbin/nagios3 -d /etc/nagios3/nagios.cfg
nagios 123579 1.1 0.2 3115628 135080 ? SN 16:03 0:00 \_ /usr/sbin/nagios3 -d /etc/nagios3/nagios.cfg
nagios 123643 0.2 0.0 4480 760 ? SN 16:03 0:00 | \_ sh -c /usr/lib/nagios/plugins/check_ping -H '192.168.0.44' -w 5000,100% -c 5000,100% -p 1
nagios 123659 0.0 0.0 8396 944 ? SN 16:03 0:00 | \_ /usr/lib/nagios/plugins/check_ping -H 192.168.0..44 -w 5000,100% -c 5000,100% -p 1
nagios 123660 0.0 0.0 4404 744 ? SN 16:03 0:00 | \_ /bin/ping -n -U -w 10 -c 1 192.168.0..44
nagios 123582 0.7 0.2 3115628 135080 ? SN 16:03 0:00 \_ /usr/sbin/nagios3 -d /etc/nagios3/nagios.cfg
nagios 123633 0.4 0.0 4480 844 ? SN 16:03 0:00 | \_ sh -c /usr/lib/nagios/plugins/check_ping -H '192.168.0..10' -w 5000,100% -c 5000,100% -p 1
nagios 123667 0.0 0.0 8396 1016 ? SN 16:03 0:00 | \_ /usr/lib/nagios/plugins/check_ping -H 192.168.0.10 -w 5000,100% -c 5000,100% -p 1
nagios 123668 0.0 0.0 4404 800 ? SN 16:03 0:00 | \_ /bin/ping -n -U -w 10 -c 1 192.168.0.10
...etc
$ ipcs -q
Code: Select all
------ Message Queues --------
key msqid owner perms used-bytes messages
Re: Soft state counter decreasing?!?!?
Posted: Wed Oct 28, 2015 5:20 pm
by tmcdonald
Did you configure NDO (ndoutils, ndo2db, etc) on this machine by any chance?
Re: Soft state counter decreasing?!?!?
Posted: Wed Oct 28, 2015 5:20 pm
by Box293
I suspected multiple Nagios processes being the cause of the problem but as we can see there is only one parent nagios process.
ipcs relates to ndoutils which appears as though you're not using it.
I see you're on nagios3, I suggest upgrading to 4.1.1 as there have been many updates since then.
Are you using a load balancer like Mod_Gearman?
Re: Soft state counter decreasing?!?!?
Posted: Wed Oct 28, 2015 5:23 pm
by hsmith
Also, what was the installation method for this? Source/repo? OS? Would be nice to have this information handy for investigation.
Re: Soft state counter decreasing?!?!?
Posted: Thu Oct 29, 2015 4:23 pm
by zzt
It's running on Ubuntu 12.04.5 LTS, and installed using apt-get off our local mirror. There's no sign of NDO or Gearman either. The load average on this thing is pretty high thought. The CPUs (2 proc, 24 cores) aren't that busy, but there are a ton of processes, but I guess that's just nagios being nagios:
Code: Select all
top - 21:21:38 up 36 days, 20:28, 1 user, load average: 33.02, 29.96, 28.02
Tasks: 795 total, 13 running, 782 sleeping, 0 stopped, 0 zombie
Cpu(s): 4.3%us, 26.1%sy, 23.8%ni, 41.9%id, 3.5%wa, 0.0%hi, 0.5%si, 0.0%st
Mem: 49411656k total, 14925488k used, 34486168k free, 482744k buffers
Swap: 4194300k total, 0k used, 4194300k free, 4807712k cached
Code: Select all
21:21:55 up 36 days, 20:28, 1 user, load average: 30.57, 29.59, 27.94
Re: Soft state counter decreasing?!?!?
Posted: Thu Oct 29, 2015 6:19 pm
by Box293
zzt wrote:It's running on Ubuntu 12.04.5 LTS, and installed using apt-get off our local mirror.
This is most likely core 3.x. There were some major performance improvements in 4.x and I recommend you upgrade to 4.1.1.
I do have some guides here for installing from Source on Ubuntu:
http://sites.box293.com/nagios/guides/i ... untu-14-04
However that is version 4.0.8, I don't have a guide for 4.1.1 on Ubuntu at the minute however the relevant URLs are in this CentOS guide:
http://sites.box293.com/nagios/guides/i ... centos-6-7
You would need to take a copy of your configs first and then uninstall the old version before installing from source.
You could look at installing Mod_Gearman locally which may help.
A RAM Disk would also greatly improve performance. This guide is for Nagios XI on CentOS/RHEL however the concepts are identical.
https://assets.nagios.com/downloads/nag ... giosXI.pdf