NagiosXI had a seizure

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: NagiosXI had a seizure

Post by BanditBBS »

I only did this on my dev box first. I'll probably do it on prod Wednesday and will be able to report back then with improvements and exactly how I did it.

Also..back to my opening post, any idea why my ccm log causes errors and I can't really see much at all in it?
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: NagiosXI had a seizure

Post by slansing »

What errors is it causing? Do you have an example? This is something I've/we've brought up, fleshing out the CCM log a bit to include more input on what a CCM user does while they are logged in.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: NagiosXI had a seizure

Post by BanditBBS »

slansing wrote:What errors is it causing? Do you have an example? This is something I've/we've brought up, fleshing out the CCM log a bit to include more input on what a CCM user does while they are logged in.
Well, if I click on the CCM Log link, it initially comes up blank and if I leave the search box blank and hit search, a few items come up, but I know its 1% of the stuff I've done in the CCM.

Examples of the errors in the first post and welcome to this fun thread Sam :)
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: NagiosXI had a seizure

Post by scottwilkerson »

What is set if you go to?
Admin -> Performance Settings -> Databases Tab -> NagiosQL Database section -> Max Logbook Age

Only items less than this many minutes are retained...
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: NagiosXI had a seizure

Post by BanditBBS »

scottwilkerson wrote:What is set if you go to?
Admin -> Performance Settings -> Databases Tab -> NagiosQL Database section -> Max Logbook Age

Only items less than this many minutes are retained...
480...so that explains why so low. So, not I'm ok with this, still wouldnt mind fixing the errors in the http error log though.

Also, just one more related question......why isn't there a delay or some sort of check built into the apply config and startup scripts to make sure the one doesn't start before the other?
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: NagiosXI had a seizure

Post by scottwilkerson »

BanditBBS wrote: 480...so that explains why so low. So, not I'm ok with this
You can change this to whatever amount you want...
BanditBBS wrote:still wouldnt mind fixing the errors in the http error log though.
Patches already committed for next release... ;)
BanditBBS wrote:
scottwilkerson wrote:What is set if you go to?
Admin -> Performance Settings -> Databases Tab -> NagiosQL Database section -> Max Logbook Age

Only items less than this many minutes are retained...
480...so that explains why so low. So, not I'm ok with this, still wouldnt mind fixing the errors in the http error log though.

Also, just one more related question......why isn't there a delay or some sort of check built into the apply config and startup scripts to make sure the one doesn't start before the other?
Sorry, I'm not exactly sure which items you are referring to... Could you elaborate
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: NagiosXI had a seizure

Post by BanditBBS »

Scott, I an referring to this:
sreinhardt wrote:
it happened Sunday morning at the exact same time an Apply Configuration was processed. It also appeared to want to happen this morning, but I forced the processes to start and everything is fine.
That would be a symptom of the ndo\nagios sql race condition that abrist was talking about. Are your configs stored locally to the nagios server on a ramdisk or hard disk? Generally this seems to be due to nagios marking everything inactive while it imports the configs then marking them active again in the DB once everything has been loaded into memory and such. If NDO connects prior to the completed import and re-activation, it wreaks all sorts of havoc. Generally we see this with higher latency systems between nagios and mysql, or if the nagios configs are on a san\nas that are acting poorly. Somehow I'm guessing this isn't normally the case for you, since it hasn't been mentioned before. This is usually noticed on an apply config or nagios service restart.

Resolutions for some customers up to this point have been:
reduce latency to the mysql server from the nagios server.
Move the nagios configs to your ramdisk as well, ideally rsync them back to the actual hdd so that they are kept and do not require applying before they would be imported in the case of a server reboot.
Move the nagios configs local to the system if on a san\nas, not likely your issue.

Other things that might impact it:
high load or disk io on the nagios server when applying config
backups or other high traffic\disk activity on either mysql or nagios when apply config happens
other abnormal network traffic while this may have happened

I just wanted to get this info to you and see if it might make sense in your case, it certainly sounds like this is what is happening, as it really only seems to effect large installs with particular optimizations in place.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: NagiosXI had a seizure

Post by scottwilkerson »

Tobe totally honest, I don't know that this exactly is how I would have diagnosed the problem. A race condition, possibly, but if I had to guess, it was because nagios couldn't exit in a timely manner, and I would say the likely cause of that would more likely be waiting for mod_gearman checks...

If I were you I would do the following to give nagios a bit more time to finish before starting.

in /etc/init.d/nagios around line 173 change

Code: Select all

for i in 1 2 3 4 5 6 7 8 9 10 ; do
to

Code: Select all

for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ; do
This will give nagios up to 30 seconds to close cleanly before starting again.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: NagiosXI had a seizure

Post by BanditBBS »

Scott,

I will make that change. However, I am leaning a bit more towards the race condition because the issue can happen on a server reboot as well as an "Apply Configuration", so that would rule out the issue you mentioned. However, there have been instances I have seen that could have been the result of what you you said, so I am gladly making those changes.

p.s. Why are we working at 9:30? Well, 8:30 your time....
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: NagiosXI had a seizure

Post by scottwilkerson »

BanditBBS wrote:p.s. Why are we working at 9:30? Well, 8:30 your time....
cleaning up some loose ends... It's going to be a long week.....
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Locked