Page 7 of 13
Re: NDO2DB Issue out of the blue
Posted: Tue Aug 25, 2015 10:00 am
by rseiwert
jfrickson wrote:rseiwert wrote:I'm wondering, if you set NDO2DEBUG directive and it does not daemonize, doesn't that stop it from using IPC and cause everything to run in one process?
It does run in just one process, but it still uses IPC.
With one process what's it using inter-process communication for then?
Re: NDO2DB Issue out of the blue
Posted: Tue Aug 25, 2015 10:12 am
by tmcdonald
rseiwert wrote:With one process what's it using inter-process communication for then?
I do believe that's for receiving information from the nagios process, via ndomod (the NEB module that ships off data from nagios to ndo2db). I could be wrong, or we could be talking about different things. IPC being a broad term, are you referring to the kernel message queue?
Re: NDO2DB Issue out of the blue
Posted: Tue Aug 25, 2015 12:46 pm
by BanditBBS
it is doing it so often now I have resorted to putting in an event handler to restart ndo2db whenever it sees the queue numbers starting to climb

This definitely isn't good solution as we're losing all those messages.
Re: NDO2DB Issue out of the blue
Posted: Tue Aug 25, 2015 5:08 pm
by tmcdonald
I really wish I had a better answer for you, and trust me when I say we're all stressing over this one.
We've not heard back about a permanent fix, and each new patch we get only works part of the time or makes things worse. All I can think to ask is what has changed in your system? We definitely believe it to be related to a certain check's output, and I had suspected WMI in the past but disabling all WMI checks did not solve anything. If you can come up with a list of things in the last 2 weeks we can work off of that. I don't have much more unfortunately, and I hate having to give that answer

Re: NDO2DB Issue out of the blue
Posted: Tue Aug 25, 2015 5:24 pm
by BanditBBS
tmcdonald wrote:I really wish I had a better answer for you, and trust me when I say we're all stressing over this one.
We've not heard back about a permanent fix, and each new patch we get only works part of the time or makes things worse. All I can think to ask is what has changed in your system? We definitely believe it to be related to a certain check's output, and I had suspected WMI in the past but disabling all WMI checks did not solve anything. If you can come up with a list of things in the last 2 weeks we can work off of that. I don't have much more unfortunately, and I hate having to give that answer

Well...we make changes all over the place, but I guess I could sort the services by the ID column and see which ones may have been just added and go from there. Let me work on that and I'll update.
EDIT: Looked through the entire month of August additional services added - They are all already monitored on other hosts and nothing special about any of them, nothing I can think of that could be throwing anything odd
Re: NDO2DB Issue out of the blue
Posted: Tue Aug 25, 2015 6:13 pm
by rseiwert
tmcdonald wrote:IPC being a broad term, are you referring to the kernel message queue?
When I said IPC I was referring to the System V InterProcess Communication System which is viewed with the ipcs command.
Re: NDO2DB Issue out of the blue
Posted: Tue Aug 25, 2015 6:24 pm
by rseiwert
Looking through this thread I don't see any ndo2db.debug log
Did you set the following in /usr/local/nagios/etc/ndo2db.cfg?
Code: Select all
# DEBUG LEVEL
# This option determines how much (if any) debugging information will
# be written to the debug file. OR values together to log multiple
# types of information.
# Values: -1 = Everything
# 0 = Nothing
# 1 = Process info
# 2 = SQL queries
debug_level=-1
# DEBUG VERBOSITY
# This option determines how verbose the debug log out will be.
# Values: 0 = Brief output
# 1 = More detailed
# 2 = Very detailed
debug_verbosity=2
# DEBUG FILE
# This option determines where the daemon should write debugging information.
debug_file=/usr/local/nagios/var/ndo2db.debug
If so what is in the ndo2db.debug file? Have you tried sorting this file by line length? What is in the longest line? Having experienced this problem for several weeks I would like to see this solved as well. For me I noticed the check data for another check in the one that was actually the one causing the problem. Have you noticed any checks that have some other checks data?
Re: NDO2DB Issue out of the blue
Posted: Wed Aug 26, 2015 8:31 am
by BanditBBS
I have it doing the debug log now so I can check for what you asked. I'll let you know.
Interesting note: 3rd day in a row where it crashed multiple times in the morning, specifically at least once between 8:00 and 8:10am. The nice thing, I have event handler in to restart it now and then send me a SMS so I know it got restarted. Good band-aid for now, but yeah, this def needs fixed, no telling how much information we're losing.
Re: NDO2DB Issue out of the blue
Posted: Wed Aug 26, 2015 9:06 am
by tmcdonald
One of the thoughts was that there is a certain combination of message length, newline placement, and presence of a delimiter (= or : if I recall correctly) that causes ndo to enter an infinite loop parsing one specific message, though we have not narrowed down exactly what that combination is. You say that between 8:00 and 8:10 this happens, is that consistent? There may be a backup, scan, or other scheduled event running on a remote machine that causes a check to return output that matches those criteria. Can you think of anything on your monitored machines that would do this?
Re: NDO2DB Issue out of the blue
Posted: Wed Aug 26, 2015 9:16 am
by BanditBBS
tmcdonald wrote:One of the thoughts was that there is a certain combination of message length, newline placement, and presence of a delimiter (= or : if I recall correctly) that causes ndo to enter an infinite loop parsing one specific message, though we have not narrowed down exactly what that combination is. You say that between 8:00 and 8:10 this happens, is that consistent? There may be a backup, scan, or other scheduled event running on a remote machine that causes a check to return output that matches those criteria. Can you think of anything on your monitored machines that would do this?
We have lots of checks that use those delimeters, so yeah, could be and also plenty of long output as well. I was thinking of maybe doing this:
Code: Select all
echo "
alter table nagios_servicestatus modify output varchar(65535) not null,modify long_output varchar(65535) not null,modify perfdata varchar(65535) not null;
alter table nagios_hoststatus modify output varchar(65535) not null, modify long_output varchar(65535) not null,modify perfdata varchar(65535) not null;
alter table nagios_servicechecks modify output varchar(65535) not null,modify long_output varchar(65535) not null,modify perfdata varchar(65535) not null;
alter table nagios_hostchecks modify output varchar(65535) not null,modify long_output varchar(65535) not null,modify perfdata varchar(65535) not null;
" | mysql -pnagiosxi nagios
To see if that makes any difference...what do you think of me trying that? And instead of 65535, the default and current is 255 right? Any issue with me just setting to 1024?