Thank you for your diligence and corroboration on how nagios should run (as long as the user is doing what they are saying they are doing).
I did find the issue, and, I believe, you questioned me about it earlier in this thread. Nagios was _not_ being restarted. Someone, or somehow, the pid file (in our case: NagiosRunFile=/var/nagios/nagios.pid) did not exist.
I had been using 'restart' in our scripts and also tried 'reload' and 'force-reload' in my debugging efforts.
I've checked the init.d startup file (at least the one generated from and rpm built nagios) and see why it did not restart when the pid file is absent. Similar issues for reload and force reload.
I will be building some defense against such potential issues. FWIW I'll be doing stop and then start and if I don't get a good result from stop, I'll first rebuild the pid file (from a pgrep of the controller process (this one: nagios 1410 1 0 13:00 ? 00:00:23 /usr/bin/nagios -d /etc/nagios/nagios.cfg)) and then calling stop again. Should that not give joy, I'll do an xargs -n1 kill -15 and then start nagios.
I'm avoiding kill -9 for obvious reasons and -15 works fine. BTW, I'm running this under an HA configuration that's a bit different in that it tries really hard to keep it on the original master node and along with that and other things we do to remove/add nodes to the nagios config, we might be creating our own problem if we end up losing the pid file from 'too many heads in the soup'. Ordinarily I like to keep the hA controls to 1 head.
Would you suggest using reload instead of restart when it is config file changes alone that we are making (and we have our own built in 'sanity check', essentially running the 'pre-flight' check after any config file changes prior to restarting nagios)?
Again, thanks very much for your time and invaluable assistance, jdalrymple. I apologize for not confirming the 'restart' behavior after you mentioned that potential issue in a prior post in this topic.
The changes we were making "did work before and did not work now and we changed nothing". Always a head scratcher.
remove host from hostgroup but it still gets service checks
-
tredlightly
- Posts: 8
- Joined: Thu Feb 19, 2015 3:34 pm
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: remove host from hostgroup but it still gets service che
I don't need to tell you that our init script isn't very intelligent. It does NOT handle situations very well where the nagios process is interfered with outside of the scope of that script since it relies so heavily on getting/setting its own variables and the process itself doesn't handle this.
My recommendation - if you're using a very complex environment (which yours does sound like) I'd encourage you to patch the source and make the program manage its own PID file, then when the init script goes back to reference it you can be rest assured that it's getting current data. Otherwise there is very little in the way of usefulness between HUP (reload) and TERM (restart). You could add more intelligence to the init script, but my opinion is that it is already bloated and has potential to break (as you've seen).
It will no doubt get a major overhaul very soon courtesy of all the Linuxes moving to systemd. For now, your best bet is to either move the intelligence of the script outside of the script, or find a way that your HA environment can better keep those files that the init script relies on up to date.
Glad you got it resolved nonetheless!
My recommendation - if you're using a very complex environment (which yours does sound like) I'd encourage you to patch the source and make the program manage its own PID file, then when the init script goes back to reference it you can be rest assured that it's getting current data. Otherwise there is very little in the way of usefulness between HUP (reload) and TERM (restart). You could add more intelligence to the init script, but my opinion is that it is already bloated and has potential to break (as you've seen).
It will no doubt get a major overhaul very soon courtesy of all the Linuxes moving to systemd. For now, your best bet is to either move the intelligence of the script outside of the script, or find a way that your HA environment can better keep those files that the init script relies on up to date.
Glad you got it resolved nonetheless!
-
tredlightly
- Posts: 8
- Joined: Thu Feb 19, 2015 3:34 pm
Re: remove host from hostgroup but it still gets service che
Thanks for the advice. I will put it to good use.
Re: remove host from hostgroup but it still gets service che
Mind if we close this thread up?
Former Nagios employee
-
tredlightly
- Posts: 8
- Joined: Thu Feb 19, 2015 3:34 pm
Re: remove host from hostgroup but it still gets service che
Not at all. Please feel free to do so. Thanks again, very much appreciated.