NDO2DB Issue out of the blue

Post by **Box293** » Wed Aug 19, 2015 11:20 pm

When running this:

ipcs -q

Do you see more that one nagios queue?

Try this:

Code: Select all

service nagios stop
service ndo2db stop
for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
service ndo2db start
service nagios start
ipcs -q

Do you only see one queue?

Post by **BanditBBS** » Wed Aug 19, 2015 11:30 pm

Yes, like 10 or so. Did you command and now just the one. Taking forever for XI to get back to the ~16646 services count. Before I did you commands the damn thing kept stopping at 7386 services and would just sit there and hand my NDO. Restarting NDO would get it to start moving again, but very slowly.....I'm up to 15253 services...so almost there, right now everything in XI shows scheduled but not run yet, even though it is running checks and the numbers in XI are updating.

Post by **Box293** » Wed Aug 19, 2015 11:41 pm

Yeah I don't understand why it decides to create more queues, the other queues never seemed to get processed.

Post by **BanditBBS** » Thu Aug 20, 2015 8:13 am

At 4am it seemed to break. Our service desk contacted my Indian employee and he initiated an "Apply Changes" and that got everything working. I woke up this morning to everything working, however, look at this screenshot:

Capture.PNG

Why does XI think nagios is not running, it is, and clearly it is as you can see there are numbers shown in checks performed area. everything is working so well right now, I don't want to touch it until it breaks again, lol.

rseiwert · Post by **rseiwert** » Thu Aug 20, 2015 9:26 am

Sorry I didn't post sooner but this seems very similar to something that was happening to me. In my case I had a check returning way to much data and it was crashing NDO2DB. There was nothing that pointed at the bad check either. The other thing that complicated the issue is that the system health checks and the init.d scripts are lazy. They keep the PID in a file and if that process crashes and the PID is reused Nagios thinks the service is up. What's even worse, if you do a restart on the service it will kill off some other critical service.

https://support.nagios.com/forum/viewto ... 16&t=32516

https://support.nagios.com/forum/viewto ... 16&t=32290

https://support.nagios.com/forum/viewto ... 16&t=32206

NDO Utils Readme states
***************
!! IMPORTANT !!
***************
This code is still an alpha/beta quality, so expect problems if you intend to use
it. Make sure that you aren't using it with your only production installation of
Nagios, or it could take down the Nagios process if the NDOMOD module segfaults.
Nagios could segfault silently and you might never know that Nagios crashed...

tmcdonald · Post by **tmcdonald** » Thu Aug 20, 2015 2:40 pm

I know we had some issues with large output before. @BanditBBS, did you ever modify the SQL tables to hold more than the 1024 (or whatever the default was) characters?

Post by **BanditBBS** » Thu Aug 20, 2015 2:47 pm

tmcdonald wrote:I know we had some issues with large output before. @BanditBBS, did you ever modify the SQL tables to hold more than the 1024 (or whatever the default was) characters?

Yeah, I do remember doing that, but that was for it displaying on the screen in XI. I never re-did that change when I migrated to new server.

Update on issue: Things have been running fine now for 5.5 hours(record for the past week). The change I made: disabled auto rescheduling. I'm waiting for a solid day of no issues and then was going to declare it "fixed" but would be concerned if that is causing issues yet again.

Post by **BanditBBS** » Thu Aug 20, 2015 2:56 pm

Ugh, as soon as I posted that, NDo broke:

Code: Select all

[root@iss-chi-nag09 ~]# ipcs -q

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0xee070002 1409024    nagios     600        88145920     86080

Restarted NDO2DB and now I have this:

Code: Select all

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0xee070002 1409024    nagios     600        100672512    98313
0x50070002 1441793    nagios     600        0            0

rseiwert · Post by **rseiwert** » Thu Aug 20, 2015 3:20 pm

It's not (at least in my case) the size of the fields in the SQL table. The check results are passed into NDO2DB and NDO uses IPC to pass that to it's worker process. If the check is larger than the size of the buffer it allocates it crashes. If you are running checkwmiplus I modified checkwmiplus so it cannot overflow NDO. The problem is that the check that is causing the overflow is not being seen or report by Nagios so you wouldn't know. In my case I was listing the application event log errors over the last hour. I had a mail server starting to throw crazy errors which created a huge check. NDO takes this and then passes it without checking for a buffer overrun. This of course corrupts the IPC queue. I created a check with a large text section to replicate or as someone found out just run checkwmiplus in debug mode will replicate the issue. Debug mode on NDOUtils is also useless in troubleshooting this.

Post by **BanditBBS** » Thu Aug 20, 2015 3:23 pm

rseiwert wrote:It's not (at least in my case) the size of the fields in the SQL table. The check results are passed into NDO2DB and NDO uses IPC to pass that to it's worker process. If the check is larger than the size of the buffer it allocates it crashes. If you are running checkwmiplus I modified checkwmiplus so it cannot overflow NDO. The problem is that the check that is causing the overflow is not being seen or report by Nagios so you wouldn't know. In my case I was listing the application event log errors over the last hour. I had a mail server starting to throw crazy errors which created a huge check. NDO takes this and then passes it without checking for a buffer overrun. This of course corrupts the IPC queue. I created a check with a large text section to replicate or as someone found out just run checkwmiplus in debug mode will replicate the issue. Debug mode on NDOUtils is also useless in troubleshooting this.

16000+ services being checked...if this is indeed the issue you described, I have no clue how I am going to find it.

Nagios Support Forum

NDO2DB Issue out of the blue

Re: NDO2DB Issue out of the blue

Re: NDO2DB Issue out of the blue

Re: NDO2DB Issue out of the blue

Re: NDO2DB Issue out of the blue

Re: NDO2DB Issue out of the blue

Re: NDO2DB Issue out of the blue

Re: NDO2DB Issue out of the blue

Re: NDO2DB Issue out of the blue

Re: NDO2DB Issue out of the blue

Re: NDO2DB Issue out of the blue