NDO2DB Issue out of the blue

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: NDO2DB Issue out of the blue

Post by Box293 »

When running this:

Code: Select all

ipcs -q
Do you see more that one nagios queue?

Try this:

Code: Select all

service nagios stop
service ndo2db stop
for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
service ndo2db start
service nagios start
ipcs -q
Do you only see one queue?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: NDO2DB Issue out of the blue

Post by BanditBBS »

Yes, like 10 or so. Did you command and now just the one. Taking forever for XI to get back to the ~16646 services count. Before I did you commands the damn thing kept stopping at 7386 services and would just sit there and hand my NDO. Restarting NDO would get it to start moving again, but very slowly.....I'm up to 15253 services...so almost there, right now everything in XI shows scheduled but not run yet, even though it is running checks and the numbers in XI are updating.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: NDO2DB Issue out of the blue

Post by Box293 »

Yeah I don't understand why it decides to create more queues, the other queues never seemed to get processed.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: NDO2DB Issue out of the blue

Post by BanditBBS »

At 4am it seemed to break. Our service desk contacted my Indian employee and he initiated an "Apply Changes" and that got everything working. I woke up this morning to everything working, however, look at this screenshot:
Capture.PNG
Why does XI think nagios is not running, it is, and clearly it is as you can see there are numbers shown in checks performed area. everything is working so well right now, I don't want to touch it until it breaks again, lol.
You do not have the required permissions to view the files attached to this post.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
rseiwert
Posts: 196
Joined: Wed Jun 22, 2011 10:33 pm
Location: Somewhere between Here and Now

Re: NDO2DB Issue out of the blue

Post by rseiwert »

Sorry I didn't post sooner but this seems very similar to something that was happening to me. In my case I had a check returning way to much data and it was crashing NDO2DB. There was nothing that pointed at the bad check either. The other thing that complicated the issue is that the system health checks and the init.d scripts are lazy. They keep the PID in a file and if that process crashes and the PID is reused Nagios thinks the service is up. What's even worse, if you do a restart on the service it will kill off some other critical service.

https://support.nagios.com/forum/viewto ... 16&t=32516

https://support.nagios.com/forum/viewto ... 16&t=32290

https://support.nagios.com/forum/viewto ... 16&t=32206

NDO Utils Readme states
***************
!! IMPORTANT !!
***************
This code is still an alpha/beta quality, so expect problems if you intend to use
it. Make sure that you aren't using it with your only production installation of
Nagios, or it could take down the Nagios process if the NDOMOD module segfaults.
Nagios could segfault silently and you might never know that Nagios crashed...
Grumpy Olde IT Guy
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: NDO2DB Issue out of the blue

Post by tmcdonald »

I know we had some issues with large output before. @BanditBBS, did you ever modify the SQL tables to hold more than the 1024 (or whatever the default was) characters?
Former Nagios employee
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: NDO2DB Issue out of the blue

Post by BanditBBS »

tmcdonald wrote:I know we had some issues with large output before. @BanditBBS, did you ever modify the SQL tables to hold more than the 1024 (or whatever the default was) characters?
Yeah, I do remember doing that, but that was for it displaying on the screen in XI. I never re-did that change when I migrated to new server.

Update on issue: Things have been running fine now for 5.5 hours(record for the past week). The change I made: disabled auto rescheduling. I'm waiting for a solid day of no issues and then was going to declare it "fixed" but would be concerned if that is causing issues yet again.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: NDO2DB Issue out of the blue

Post by BanditBBS »

Ugh, as soon as I posted that, NDo broke:

Code: Select all

[root@iss-chi-nag09 ~]# ipcs -q

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0xee070002 1409024    nagios     600        88145920     86080
Restarted NDO2DB and now I have this:

Code: Select all

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0xee070002 1409024    nagios     600        100672512    98313
0x50070002 1441793    nagios     600        0            0
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
rseiwert
Posts: 196
Joined: Wed Jun 22, 2011 10:33 pm
Location: Somewhere between Here and Now

Re: NDO2DB Issue out of the blue

Post by rseiwert »

It's not (at least in my case) the size of the fields in the SQL table. The check results are passed into NDO2DB and NDO uses IPC to pass that to it's worker process. If the check is larger than the size of the buffer it allocates it crashes. If you are running checkwmiplus I modified checkwmiplus so it cannot overflow NDO. The problem is that the check that is causing the overflow is not being seen or report by Nagios so you wouldn't know. In my case I was listing the application event log errors over the last hour. I had a mail server starting to throw crazy errors which created a huge check. NDO takes this and then passes it without checking for a buffer overrun. This of course corrupts the IPC queue. I created a check with a large text section to replicate or as someone found out just run checkwmiplus in debug mode will replicate the issue. Debug mode on NDOUtils is also useless in troubleshooting this.
Grumpy Olde IT Guy
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: NDO2DB Issue out of the blue

Post by BanditBBS »

rseiwert wrote:It's not (at least in my case) the size of the fields in the SQL table. The check results are passed into NDO2DB and NDO uses IPC to pass that to it's worker process. If the check is larger than the size of the buffer it allocates it crashes. If you are running checkwmiplus I modified checkwmiplus so it cannot overflow NDO. The problem is that the check that is causing the overflow is not being seen or report by Nagios so you wouldn't know. In my case I was listing the application event log errors over the last hour. I had a mail server starting to throw crazy errors which created a huge check. NDO takes this and then passes it without checking for a buffer overrun. This of course corrupts the IPC queue. I created a check with a large text section to replicate or as someone found out just run checkwmiplus in debug mode will replicate the issue. Debug mode on NDOUtils is also useless in troubleshooting this.
16000+ services being checked...if this is indeed the issue you described, I have no clue how I am going to find it.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Locked