Page 2 of 2
Re: Nagios IPCS stops processing
Posted: Mon Apr 20, 2015 1:33 pm
by tgriep
Do you still have access to the ndodb.debug.old log file with the errors? Could you post it to the forum so we can see it?
Re: Nagios IPCS stops processing
Posted: Mon Apr 20, 2015 5:46 pm
by rseiwert
Even with ndo2db logging set to everything and debug set to verbose all that is in the log is the insert statements. I did have an exchange server going nuts with about 500 errored application log events/hour most of the day. I do have check_wmi_plus checking this and it does return the a list of events in the body of the check. Typically since this is reporting only the last hour this is small and then they are put through the db they and are truncated. I'm assuming that these large check returns were choking ndo2db. I know I should not return large checks but sometimes things do go askew and I would argue we need to harden the system against such chaos. It should NOT simply stop processing and keep showing everything is OK. I just got done wrapping check_wmi_plus in a shell script to truncate it's output before nagios consumes it. I don't know for sure if that was the issue. Of course by the time I knocked the rust off my shell scripting and got it done whatever was causing this issue also was gone.
Re: Nagios IPCS stops processing
Posted: Tue Apr 21, 2015 2:22 pm
by tgriep
Thanks for the log file. It will help in debugging the issue.
Re: Nagios IPCS stops processing
Posted: Wed Apr 22, 2015 5:07 pm
by rseiwert
I have been able to replicate this by recreating the issue. I simply opened up the check to show all event logs for the last couple of days and BAM! started getting queue messages and invalid check results.
I do know that I shouldn't return that much data but I also feel that a little GIGO checking will go a long way to improving system stability.
Re: Nagios IPCS stops processing
Posted: Thu Apr 23, 2015 11:48 am
by rseiwert
Someone (tmcdonald) mentioned that they had a way of inspecting these messages. I would like to know so that this could be used when the problem is occurring. Also if anyone can tell me more about the NDO processes, like why there are three and what each one does.
I researching this issue I found this in the NDOUtils Readme
Code: Select all
***************
!! IMPORTANT !!
***************
This code is still an alpha/beta quality, so expect problems if you intend to use
it. Make sure that you aren't using it with your only production installation of
Nagios, or it could take down the Nagios process if the NDOMOD module segfaults.
Nagios could segfault silently and you might never know that Nagios crashed...
later in the document
Code: Select all
ndomod-2x.o = NDOMOD module for Nagios 2.x
ndomod-3x.o = NDOMOD module for Nagios 3.x
ndomod-4x.o = NDOMOD module for Nagios 4.x (unstable)
Some IPCS stuff I found out. If you stop and start or crash and restart the old message queue will persist. You can use the -p to figure out which one is relevant and which process ID is pitching and which one is catching. I'm beginning to think that it is the MySQL that is blocking the third NDO2DB process which is backing up the queue.
Code: Select all
[root@nagios ~]# ps -ef | grep ndo2db | grep -v grep
nagios 49744 1 0 11:10 ? 00:00:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
nagios 49778 49744 0 11:10 ? 00:00:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
nagios 49779 49778 0 11:10 ? 00:00:08 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
[root@nagios ~]# ipcs -q -p
------ Message Queues PIDs --------
msqid owner lspid lrpid
131072 nagios 8348 8349
163841 nagios 49778 49779
[root@nagios ~]# ipcs -q -i 163841
Message Queue msqid=163841
uid=500 gid=500 cuid=500 cgid=500 mode=0600
cbytes=0 qbytes=131072000 qnum=0 lspid=49778 lrpid=49779
send_time=Thu Apr 23 12:41:51 2015
rcv_time=Thu Apr 23 12:41:51 2015
change_time=Thu Apr 23 11:10:38 2015
[root@nagios ~]# ipcs -q -t
------ Message Queues Send/Recv/Change Times --------
msqid owner send recv change
131072 nagios Apr 23 11:04:52 Apr 23 11:04:52 Apr 22 20:17:29
163841 nagios Apr 23 12:41:59 Apr 23 12:41:59 Apr 23 11:10:38
[root@nagios ~]# ipcs -q
------ Message Queues --------
key msqid owner perms used-bytes messages
0xbc000002 131072 nagios 600 0 0
0x92000078 163841 nagios 600 0 0
Re: Nagios IPCS stops processing
Posted: Thu Apr 23, 2015 12:16 pm
by tmcdonald
Working on some Perl code that can dump the queues, just need to look into the NDO code to figure out the message structure. Will update when I make more progress.
Update:
Run the following to install the correct perl module:
Code: Select all
perl -MCPAN -e 'install IPC::SysV'
then save this as dumpq.pl:
Code: Select all
#!/usr/bin/perl
use IPC::SysV;
my $id = $ARGV[0];
msgrcv($id, my $msg, 32000, 1, 0);
print "Message is:\n$msg\nEND OF MESSAGE\n";
Make sure to chmod +x it. Run ipcs -q to get the id of the full queue, then run the perl program like so:
Code: Select all
./dumpq.pl [queue id] > queue_contents.txt
It should (hopefully) write the contents of a single message to queue_contents.txt, and if they are sane we can see what's in the rest of the queue. They might be ASCII or they might be binary, so post the output file once it runs and we'll see.
If it hangs it means you ran it against an empty queue.
Re: Nagios IPCS stops processing
Posted: Thu Apr 23, 2015 1:23 pm
by rseiwert
Hopefully I will not be back to this thread again and the good ship NMS stays upright. If it does I will most certainly share what I find.
Re: Nagios IPCS stops processing
Posted: Thu Apr 23, 2015 5:20 pm
by abrist
rseiwert wrote:Hopefully I will not be back to this thread again and the good ship NMS stays upright.
As do we.
rseiwert wrote:If it does I will most certainly share what I find.
Many thanks as always.
Have a great weekend!