Page 1 of 3

Long apply configurations

Posted: Thu Sep 29, 2016 2:11 am
by WillemDH
Hello,

Our apply configurations are taking longer and longer. It can take up to 40 seconds before Nagios XI is back online. Looking around the web I see competing Nagios clones which implemented a system where a parent process is spawned which takes over the monitoring etc. The new configuration is loaded in a duplicate child process. When the new configuration is loaded compeletely, the parent process with the old configuration is killed and the new process takes over resulting in a supposed 'downtime' of only 3-5 seconds.

Is this a feature which can be implemented in Nagios XI? Honestly, the long apply configurations are one of the most annoying features of Nagios XI. During the apply configuration process, there is a Window of 15-20 seconds where the Nagios hosts and services are no longer visible. Then there is a window of 10 seconds where hosts and services which were in downtime / acknowledged are visible in the open service problems views. This results in very confusing situations with duplicate calls and frustrated colleagues.

I understand my Nagios XI instance is bigger then the average, but we really need a better and more consistent solution for the apply configuration process. Please realize about 10 - 20 apply's are done each day resulting in 10-20 timeframes of 40 seconds where our views and dashboards are flashing or not showing anything at all or showing problems that already have been acknowledged.

Thanks for looking into this.

Willem

Re: Long apply configurations

Posted: Thu Sep 29, 2016 12:04 pm
by tmcdonald
This would need to be more of a Core change than XI, but I think XI would need to be involved at some point as well, just not to the same degree.

I pinged our Core dev on this for his thoughts, will update the thread when I know more. That being said, I think this sort of functionality would be a great idea.

Re: Long apply configurations

Posted: Thu Sep 29, 2016 12:17 pm
by tmcdonald
That might work. It would need some careful coding, but that might be the easiest way to do it.
From our dev. A GitHub issue was suggested, and I can file that or you can, doesn't matter to me.

Bear in mind this would take a lot of re-architecting and testing, so it likely would not be done very soon.

Re: Long apply configurations

Posted: Thu Sep 29, 2016 1:12 pm
by WillemDH
Trevor,

I understand this would take time to implement. I'll make the GitHub issue.

https://github.com/NagiosEnterprises/na ... issues/176

Thanks

Willem

Re: Long apply configurations

Posted: Thu Sep 29, 2016 4:06 pm
by rkennedy
Thanks Willem! I'll leave this thread open should further discussion happen in the future, or if you have anything to add.

Re: Long apply configurations

Posted: Fri Dec 02, 2016 10:14 am
by WillemDH
As requested by avandemore https://support.nagios.com/forum/viewto ... 20#p204520
During these restarts, does the information show up in Core?
Yes the issue is also in Core. Just tested it.

Grtz

Willem

Re: Long apply configurations

Posted: Fri Dec 02, 2016 1:08 pm
by avandemore
This is different than the referenced thread if Core is exhibiting this behavior. Please post or PM your nagios.cfg.

Re: Long apply configurations

Posted: Sat Dec 03, 2016 3:12 pm
by WillemDH
pm'd you my config

Re: Long apply configurations

Posted: Mon Dec 05, 2016 11:39 am
by avandemore
Your configuration looks correct for Core to preserve state across a reboot. During an Apply Config, what is the output from:

Code: Select all

# tail -F /usr/local/nagios/var/retention.dat
You can also PM this if necessary.

Re: Long apply configurations

Posted: Wed Dec 07, 2016 9:47 am
by WillemDH
Avandemore,

Wel... I did as you asked. The output from the tail is super huge. Will be hard to even pm you this. Basically from the moment I'm applying it doesn't coutput fro 11 seconds and then starts ooutputting like crazy for +- 20 more seconds.

Example output:

Code: Select all

hostdowntime {
host_name=servername
comment_id=2404807
downtime_id=374057
entry_time=1480831262
start_time=1481349600
flex_downtime_start=0
end_time=1481367600
triggered_by=0
fixed=1
duration=18000
is_in_effect=0
start_notification_sent=0
author=user
comment=AUTO: alfresco rebuild index
}
It's just too much content and full of sensitive information. if you absolutely want to see this data, I suggest we do a remote support session or so.

Grtz

Willem