Page 1 of 2

Nagios IPCS stops processing

Posted: Wed Apr 08, 2015 4:30 pm
by rseiwert
Still am having an issue where XI stops updating but Core functions. If the message queues keep growing I'm wondering which process is supposed to be processing these messages?

Code: Select all

[root@nagios var]# ipcs -q
------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0x67000002 0          nagios     600        41368576     40399

Re: Nagios IPCS stops processing

Posted: Wed Apr 08, 2015 4:33 pm
by abrist
Usually ndo2db. I have been trying to figure out a good way to view what is in those messages in the queue. If you know of a good way to do so, do tell.

Re: Nagios IPCS stops processing

Posted: Wed Apr 08, 2015 5:13 pm
by rseiwert
It indeed is something with ndo2db.

Code: Select all

[root@nagios var]# ps -ef | grep ndo2db
nagios    1604     1  0 Apr07 ?        00:00:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
nagios    1863  1604  0 Apr07 ?        00:00:10 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
nagios    1864  1863 11 Apr07 ?        02:53:32 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
If you notice the 4th column is 11 which is rather high. Killing off the ndo2db process and restarting cleared the queue immediately.
Enabled debug logging in ndo2db.cfg for now and hopefully will catch whatever is crashing XI soon.

Re: Nagios IPCS stops processing

Posted: Thu Apr 09, 2015 11:28 am
by lmiltchev
Post the debug file, along with the ndo2db.cfg when you are ready (hide sensitive info).

Re: Nagios IPCS stops processing

Posted: Mon Apr 13, 2015 8:57 am
by rseiwert
Still awaiting it to crash again. XI has been working fine lately. We will see if that lasts the week.

Re: Nagios IPCS stops processing

Posted: Mon Apr 13, 2015 9:38 am
by cmerchant
Hope for two things - that is does keep working, and if it stops we can catch the illusive ipcs queue bug. Keep us posted. Thanks.

Re: Nagios IPCS stops processing

Posted: Thu Apr 16, 2015 11:23 am
by rseiwert
XI has been stable for awhile now. I'm sure I had a check going nuts somewhere but with XI crashed I couldn't tell what it was.

Could someone please make a feature request for me that the ndo2db check in sysstat.php checks that the inter-process communication message queue is being processed. A backed up ipc queue would indicate a hung/choked ndo2db process for one reason or another. If ndo2db is not processing these messages then XI is not updating. When I have experienced this issue the system health in XI continues to show green and that ndo2db is running. It seems to me sysstat.php should check that the process is running, is actually running ndo2db, and is functional either via some heartbeat and/or checking the message queues (ipcs -q). If there are more than ??? 100 messages in the queue go red. I'm really not sure what is an acceptable number of messages but I know mine is normally at zero but when non-functional was over 40 thousand.

Before anyone suggests building a nagios check to watch this remember that XI doesn't update when this is down. This is why sysstat.php needs to check for this.

Feel free to close this.

Re: Nagios IPCS stops processing

Posted: Thu Apr 16, 2015 11:27 am
by rseiwert
Actually could someone post this as a feature request.

Re: Nagios IPCS stops processing

Posted: Thu Apr 16, 2015 11:41 am
by tmcdonald
I can make that request for you.

Are you still seeing or able to reproduce the kernel queue filling up? I think I might have a way to peek into the message queue, but I would need to do it on a live system that is exhibiting the behavior and we have not been able to reproduce this in-house.

Re: Nagios IPCS stops processing

Posted: Mon Apr 20, 2015 11:19 am
by rseiwert
Just did it again today. NDODB is choked. Looing at the ndodb.debug there is nothing there but looking in ndodb.debug.old I did see some checks that seemed to be return huge amounts of data. Of course once in the DB they are truncated but in the processing queue I wonder. I'm looking at the application log check and seeing if I can limit / truncate the results. Also going to try to look at ndo2db and see what I can see. Leaving the system crashed for now to troubleshoot, but will need to restart soon.

Code: Select all

[root@nagios libexec]# ipcs -q

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0x67000002 0          nagios     600        0            0
0xa5000002 65537      nagios     600        21795840     21285

[root@nagios libexec]# ps -ef | grep ndo2db | grep -v grep
nagios   50309 61487  0 Apr16 ?        00:00:44 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
nagios   50310 50309  2 Apr16 ?        02:21:43 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
nagios   61487     1  0 Apr08 ?        00:00:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
Of course as I was writing this it manage to get paste it's choke point and continue on.