Page 3 of 4

Re: NagiosXI had a seizure

Posted: Mon Dec 16, 2013 3:33 pm
by BanditBBS
I only did this on my dev box first. I'll probably do it on prod Wednesday and will be able to report back then with improvements and exactly how I did it.

Also..back to my opening post, any idea why my ccm log causes errors and I can't really see much at all in it?

Re: NagiosXI had a seizure

Posted: Mon Dec 16, 2013 4:27 pm
by slansing
What errors is it causing? Do you have an example? This is something I've/we've brought up, fleshing out the CCM log a bit to include more input on what a CCM user does while they are logged in.

Re: NagiosXI had a seizure

Posted: Mon Dec 16, 2013 4:38 pm
by BanditBBS
slansing wrote:What errors is it causing? Do you have an example? This is something I've/we've brought up, fleshing out the CCM log a bit to include more input on what a CCM user does while they are logged in.
Well, if I click on the CCM Log link, it initially comes up blank and if I leave the search box blank and hit search, a few items come up, but I know its 1% of the stuff I've done in the CCM.

Examples of the errors in the first post and welcome to this fun thread Sam :)

Re: NagiosXI had a seizure

Posted: Mon Dec 16, 2013 8:44 pm
by scottwilkerson
What is set if you go to?
Admin -> Performance Settings -> Databases Tab -> NagiosQL Database section -> Max Logbook Age

Only items less than this many minutes are retained...

Re: NagiosXI had a seizure

Posted: Mon Dec 16, 2013 8:52 pm
by BanditBBS
scottwilkerson wrote:What is set if you go to?
Admin -> Performance Settings -> Databases Tab -> NagiosQL Database section -> Max Logbook Age

Only items less than this many minutes are retained...
480...so that explains why so low. So, not I'm ok with this, still wouldnt mind fixing the errors in the http error log though.

Also, just one more related question......why isn't there a delay or some sort of check built into the apply config and startup scripts to make sure the one doesn't start before the other?

Re: NagiosXI had a seizure

Posted: Mon Dec 16, 2013 9:16 pm
by scottwilkerson
BanditBBS wrote: 480...so that explains why so low. So, not I'm ok with this
You can change this to whatever amount you want...
BanditBBS wrote:still wouldnt mind fixing the errors in the http error log though.
Patches already committed for next release... ;)
BanditBBS wrote:
scottwilkerson wrote:What is set if you go to?
Admin -> Performance Settings -> Databases Tab -> NagiosQL Database section -> Max Logbook Age

Only items less than this many minutes are retained...
480...so that explains why so low. So, not I'm ok with this, still wouldnt mind fixing the errors in the http error log though.

Also, just one more related question......why isn't there a delay or some sort of check built into the apply config and startup scripts to make sure the one doesn't start before the other?
Sorry, I'm not exactly sure which items you are referring to... Could you elaborate

Re: NagiosXI had a seizure

Posted: Mon Dec 16, 2013 9:22 pm
by BanditBBS
Scott, I an referring to this:
sreinhardt wrote:
it happened Sunday morning at the exact same time an Apply Configuration was processed. It also appeared to want to happen this morning, but I forced the processes to start and everything is fine.
That would be a symptom of the ndo\nagios sql race condition that abrist was talking about. Are your configs stored locally to the nagios server on a ramdisk or hard disk? Generally this seems to be due to nagios marking everything inactive while it imports the configs then marking them active again in the DB once everything has been loaded into memory and such. If NDO connects prior to the completed import and re-activation, it wreaks all sorts of havoc. Generally we see this with higher latency systems between nagios and mysql, or if the nagios configs are on a san\nas that are acting poorly. Somehow I'm guessing this isn't normally the case for you, since it hasn't been mentioned before. This is usually noticed on an apply config or nagios service restart.

Resolutions for some customers up to this point have been:
reduce latency to the mysql server from the nagios server.
Move the nagios configs to your ramdisk as well, ideally rsync them back to the actual hdd so that they are kept and do not require applying before they would be imported in the case of a server reboot.
Move the nagios configs local to the system if on a san\nas, not likely your issue.

Other things that might impact it:
high load or disk io on the nagios server when applying config
backups or other high traffic\disk activity on either mysql or nagios when apply config happens
other abnormal network traffic while this may have happened

I just wanted to get this info to you and see if it might make sense in your case, it certainly sounds like this is what is happening, as it really only seems to effect large installs with particular optimizations in place.

Re: NagiosXI had a seizure

Posted: Mon Dec 16, 2013 9:30 pm
by scottwilkerson
Tobe totally honest, I don't know that this exactly is how I would have diagnosed the problem. A race condition, possibly, but if I had to guess, it was because nagios couldn't exit in a timely manner, and I would say the likely cause of that would more likely be waiting for mod_gearman checks...

If I were you I would do the following to give nagios a bit more time to finish before starting.

in /etc/init.d/nagios around line 173 change

Code: Select all

for i in 1 2 3 4 5 6 7 8 9 10 ; do
to

Code: Select all

for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ; do
This will give nagios up to 30 seconds to close cleanly before starting again.

Re: NagiosXI had a seizure

Posted: Mon Dec 16, 2013 9:40 pm
by BanditBBS
Scott,

I will make that change. However, I am leaning a bit more towards the race condition because the issue can happen on a server reboot as well as an "Apply Configuration", so that would rule out the issue you mentioned. However, there have been instances I have seen that could have been the result of what you you said, so I am gladly making those changes.

p.s. Why are we working at 9:30? Well, 8:30 your time....

Re: NagiosXI had a seizure

Posted: Mon Dec 16, 2013 9:42 pm
by scottwilkerson
BanditBBS wrote:p.s. Why are we working at 9:30? Well, 8:30 your time....
cleaning up some loose ends... It's going to be a long week.....