Page 11 of 13
Re: NDO2DB Issue out of the blue
Posted: Thu Aug 27, 2015 4:48 pm
by jfrickson
BanditBBS wrote:Ok, so this one may be looking like it is working. Going to make sure it all looks good for a few and then head home. Will let you know after the weekend if no issues arise or before that if it still happens
Whew! I was getting worried! Cross your fingers and hope for the best.
Re: NDO2DB Issue out of the blue
Posted: Fri Aug 28, 2015 8:13 am
by BanditBBS
Well, this was the first night all week that it hasn't crashed 2-3 time between 10pm and 8:10am. I have high hopes, but not calling this completed/fixed until it goes the weekend with no issues as well....but looking good

Re: NDO2DB Issue out of the blue
Posted: Fri Aug 28, 2015 9:00 am
by jfrickson
BanditBBS wrote:Well, this was the first night all week that it hasn't crashed 2-3 time between 10pm and 8:10am. I have high hopes, but not calling this completed/fixed until it goes the weekend with no issues as well....but looking good

WHOO-HOO!!

Re: NDO2DB Issue out of the blue
Posted: Fri Aug 28, 2015 12:05 pm
by rseiwert
Performed some testing on this patch and happy to report that it fixes the ipcs queues issue and Nagios stopping without any red flags but it does still have a corrupted memory issue which causes bogus check data.
Using check_wmi_plus's event log check I normally have it set to alert on any errors in the last hour and it then lists those errors. To overflow the ndo I changed the settings to alert on any log entry in the last 100 hours and list them, this creates a larger than normal "additional" data section.
Before patching ndo2db.c I made this changed the application event log check and verified that this indeed caused Nagios to choke without throwing any red flags in the interface and the ipcs queues to grow as well as all the other issues documented in my previous posts about this. This happened almost immediately.
I then set the application log check back to normal, reporting only errors in the application log for the last hour. Verified everything was working. Applied the patch, restarted the server and verified again everything worked normally. Good so far.
I then changed the settings to alert on any application log entry in the last 100 hours and list them as before. Nagios continued to perform it's checks and the ipcs queues are at zero. Yay, no choking! Then I noticed the invalid data in the checks. I am seeing the results from other server's random checks under the Application Log check. Under my Application Log check I'm seeing the results from SNMP checks, OpenManage Checks, System OS checks, etc. In the picture below almost every application log is going to have some entry in the last 100 hours and should be critical but these are all green! The only check in this pic with a valid status from the Application Log check is sp7.
All in all, it's better than crashing but events that should be coming up critical are showing another events results and coming up OK. Since no one looks at OK checks, just warning and critical no one would ever see them and if they did they probably would think, that's curious but hey, it says it's OK. Since checks with results that are larger than expected tend to be associated with things being completely borked, the true issue is still masked and being mis-reported. In my initial run in with this it was an exchange server with bad corruption issues throwing almost 10 events a second. The event log check contained some other's check data and I never got the message application errors had occurred in the last hour.
Re: NDO2DB Issue out of the blue
Posted: Fri Aug 28, 2015 12:14 pm
by BanditBBS
rseiwert, that's so odd. Does that happen even with the unpatched ndo2db? I've had some checks that have returned a half ton of data and never saw that behavior.
Re: NDO2DB Issue out of the blue
Posted: Fri Aug 28, 2015 12:22 pm
by jfrickson
rseiwert wrote:Performed some testing on this patch and happy to report that it fixes the ipcs queues issue and Nagios stopping without any red flags but it does still have a corrupted memory issue which causes bogus check data.
Very odd. I'll look into it.
Question: On EX3 it gives an IP address. Is that EX3's IP or someone else's? If it's another machine, did that message get logged in the correct place as well?
Thanks!
Re: NDO2DB Issue out of the blue
Posted: Fri Aug 28, 2015 1:13 pm
by rseiwert
No that is another machine's IP address. It's the bigger checks that cause the memory corruption. If I force an immediate check on one that is wrong I will get the some other results from another check from another machine. If I force an immediate check yet again I get the same same, only different, the check for a different service from yet another machine. If I were to guess I would assume it's from what was previously in the message queue, but that is just an educated guess. I had to laugh when I got an alert the toner was low on my domain controller.
Re: NDO2DB Issue out of the blue
Posted: Fri Aug 28, 2015 1:15 pm
by tmcdonald
rseiwert wrote:I had to laugh when I got an alert the toner was low on my domain controller.
After the conference this is going in my forum signature

Re: NDO2DB Issue out of the blue
Posted: Fri Aug 28, 2015 1:24 pm
by jfrickson
rseiwert wrote:I had to laugh when I got an alert the toner was low on my domain controller.
Nagios. We make monitoring fun!
Sorry, here's a patch that
might fix it. No promises, though. It's pretty much a SWAG.
Re: NDO2DB Issue out of the blue
Posted: Fri Aug 28, 2015 4:13 pm
by rseiwert
Same issue with V3 of the patch when passing long results.
On my exchange server the application event log reports OK with "Linux version 2.6.32-573.3.1.el6.x86_64 (
[email protected]) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Thu Aug 13 22:55:16 UTC 2015" and another server reports "HTTP OK: HTTP/1.1 301 Moved Permanently - 164 bytes in 0.006 second response time". As the check is currently configured these checks should be returning critical and a long list of events, which I do expect to be truncated, ideally to whatever the field size is in the database. When the check result is below 8K there is no issue. I agree check results should not be over 8K but that doesn't mean when something is borked that it doesn't happen and to report that borked service is OK when it's not is, well, not OK.
Bandit, I did see this behavior in the unpatched version as well but Nagios XI also choking in the exact same way you were experiencing. When it was borderline XI would work past it's choke state and then I would notice the invalid results. To test I'm throwing 2 tons of data at the issue. It also might be because it's a check line, perf data, then extended data. I would never expect Nagios to display or keep what I'm throwing at it but I said I could duplicate this problem and this is how I duplicate it. GIGO. For you I do wonder while your system is no longer crashing what might be reporting OK when it's really down?
output.txt