Still am having an issue where XI stops updating but Core functions. If the message queues keep growing I'm wondering which process is supposed to be processing these messages?
Usually ndo2db. I have been trying to figure out a good way to view what is in those messages in the queue. If you know of a good way to do so, do tell.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
If you notice the 4th column is 11 which is rather high. Killing off the ndo2db process and restarting cleared the queue immediately.
Enabled debug logging in ndo2db.cfg for now and hopefully will catch whatever is crashing XI soon.
XI has been stable for awhile now. I'm sure I had a check going nuts somewhere but with XI crashed I couldn't tell what it was.
Could someone please make a feature request for me that the ndo2db check in sysstat.php checks that the inter-process communication message queue is being processed. A backed up ipc queue would indicate a hung/choked ndo2db process for one reason or another. If ndo2db is not processing these messages then XI is not updating. When I have experienced this issue the system health in XI continues to show green and that ndo2db is running. It seems to me sysstat.php should check that the process is running, is actually running ndo2db, and is functional either via some heartbeat and/or checking the message queues (ipcs -q). If there are more than ??? 100 messages in the queue go red. I'm really not sure what is an acceptable number of messages but I know mine is normally at zero but when non-functional was over 40 thousand.
Before anyone suggests building a nagios check to watch this remember that XI doesn't update when this is down. This is why sysstat.php needs to check for this.
Feel free to close this.
Last edited by rseiwert on Thu Apr 16, 2015 11:29 am, edited 1 time in total.
Are you still seeing or able to reproduce the kernel queue filling up? I think I might have a way to peek into the message queue, but I would need to do it on a live system that is exhibiting the behavior and we have not been able to reproduce this in-house.
Just did it again today. NDODB is choked. Looing at the ndodb.debug there is nothing there but looking in ndodb.debug.old I did see some checks that seemed to be return huge amounts of data. Of course once in the DB they are truncated but in the processing queue I wonder. I'm looking at the application log check and seeing if I can limit / truncate the results. Also going to try to look at ndo2db and see what I can see. Leaving the system crashed for now to troubleshoot, but will need to restart soon.