NDO2DB Issue out of the blue

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
User avatar
rseiwert
Posts: 196
Joined: Wed Jun 22, 2011 10:33 pm
Location: Somewhere between Here and Now

Re: NDO2DB Issue out of the blue

Post by rseiwert »

jfrickson wrote:
rseiwert wrote:I'm wondering, if you set NDO2DEBUG directive and it does not daemonize, doesn't that stop it from using IPC and cause everything to run in one process?
It does run in just one process, but it still uses IPC.
With one process what's it using inter-process communication for then?
Grumpy Olde IT Guy
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: NDO2DB Issue out of the blue

Post by tmcdonald »

rseiwert wrote:With one process what's it using inter-process communication for then?
I do believe that's for receiving information from the nagios process, via ndomod (the NEB module that ships off data from nagios to ndo2db). I could be wrong, or we could be talking about different things. IPC being a broad term, are you referring to the kernel message queue?
Former Nagios employee
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: NDO2DB Issue out of the blue

Post by BanditBBS »

it is doing it so often now I have resorted to putting in an event handler to restart ndo2db whenever it sees the queue numbers starting to climb :( This definitely isn't good solution as we're losing all those messages.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: NDO2DB Issue out of the blue

Post by tmcdonald »

I really wish I had a better answer for you, and trust me when I say we're all stressing over this one.

We've not heard back about a permanent fix, and each new patch we get only works part of the time or makes things worse. All I can think to ask is what has changed in your system? We definitely believe it to be related to a certain check's output, and I had suspected WMI in the past but disabling all WMI checks did not solve anything. If you can come up with a list of things in the last 2 weeks we can work off of that. I don't have much more unfortunately, and I hate having to give that answer :(
Former Nagios employee
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: NDO2DB Issue out of the blue

Post by BanditBBS »

tmcdonald wrote:I really wish I had a better answer for you, and trust me when I say we're all stressing over this one.

We've not heard back about a permanent fix, and each new patch we get only works part of the time or makes things worse. All I can think to ask is what has changed in your system? We definitely believe it to be related to a certain check's output, and I had suspected WMI in the past but disabling all WMI checks did not solve anything. If you can come up with a list of things in the last 2 weeks we can work off of that. I don't have much more unfortunately, and I hate having to give that answer :(
Well...we make changes all over the place, but I guess I could sort the services by the ID column and see which ones may have been just added and go from there. Let me work on that and I'll update.

EDIT: Looked through the entire month of August additional services added - They are all already monitored on other hosts and nothing special about any of them, nothing I can think of that could be throwing anything odd
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
rseiwert
Posts: 196
Joined: Wed Jun 22, 2011 10:33 pm
Location: Somewhere between Here and Now

Re: NDO2DB Issue out of the blue

Post by rseiwert »

tmcdonald wrote:IPC being a broad term, are you referring to the kernel message queue?
When I said IPC I was referring to the System V InterProcess Communication System which is viewed with the ipcs command.
Grumpy Olde IT Guy
User avatar
rseiwert
Posts: 196
Joined: Wed Jun 22, 2011 10:33 pm
Location: Somewhere between Here and Now

Re: NDO2DB Issue out of the blue

Post by rseiwert »

Looking through this thread I don't see any ndo2db.debug log

Did you set the following in /usr/local/nagios/etc/ndo2db.cfg?

Code: Select all

# DEBUG LEVEL
# This option determines how much (if any) debugging information will
# be written to the debug file.  OR values together to log multiple
# types of information.
# Values: -1 = Everything
#          0 = Nothing
#          1 = Process info
#          2 = SQL queries
debug_level=-1

# DEBUG VERBOSITY
# This option determines how verbose the debug log out will be.
# Values: 0 = Brief output
#         1 = More detailed
#         2 = Very detailed
debug_verbosity=2

# DEBUG FILE
# This option determines where the daemon should write debugging information.
debug_file=/usr/local/nagios/var/ndo2db.debug
If so what is in the ndo2db.debug file? Have you tried sorting this file by line length? What is in the longest line? Having experienced this problem for several weeks I would like to see this solved as well. For me I noticed the check data for another check in the one that was actually the one causing the problem. Have you noticed any checks that have some other checks data?
Grumpy Olde IT Guy
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: NDO2DB Issue out of the blue

Post by BanditBBS »

I have it doing the debug log now so I can check for what you asked. I'll let you know.

Interesting note: 3rd day in a row where it crashed multiple times in the morning, specifically at least once between 8:00 and 8:10am. The nice thing, I have event handler in to restart it now and then send me a SMS so I know it got restarted. Good band-aid for now, but yeah, this def needs fixed, no telling how much information we're losing.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: NDO2DB Issue out of the blue

Post by tmcdonald »

One of the thoughts was that there is a certain combination of message length, newline placement, and presence of a delimiter (= or : if I recall correctly) that causes ndo to enter an infinite loop parsing one specific message, though we have not narrowed down exactly what that combination is. You say that between 8:00 and 8:10 this happens, is that consistent? There may be a backup, scan, or other scheduled event running on a remote machine that causes a check to return output that matches those criteria. Can you think of anything on your monitored machines that would do this?
Former Nagios employee
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: NDO2DB Issue out of the blue

Post by BanditBBS »

tmcdonald wrote:One of the thoughts was that there is a certain combination of message length, newline placement, and presence of a delimiter (= or : if I recall correctly) that causes ndo to enter an infinite loop parsing one specific message, though we have not narrowed down exactly what that combination is. You say that between 8:00 and 8:10 this happens, is that consistent? There may be a backup, scan, or other scheduled event running on a remote machine that causes a check to return output that matches those criteria. Can you think of anything on your monitored machines that would do this?
We have lots of checks that use those delimeters, so yeah, could be and also plenty of long output as well. I was thinking of maybe doing this:

Code: Select all

echo "
alter table nagios_servicestatus modify output varchar(65535) not null,modify long_output varchar(65535) not null,modify perfdata varchar(65535) not null;
alter table nagios_hoststatus modify output varchar(65535) not null, modify long_output varchar(65535) not null,modify perfdata varchar(65535) not null;
alter table nagios_servicechecks modify output varchar(65535) not null,modify long_output varchar(65535) not null,modify perfdata varchar(65535) not null;
alter table nagios_hostchecks modify output varchar(65535) not null,modify long_output varchar(65535) not null,modify perfdata varchar(65535) not null;
" | mysql -pnagiosxi nagios
To see if that makes any difference...what do you think of me trying that? And instead of 65535, the default and current is 255 right? Any issue with me just setting to 1024?
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Locked