Page 1 of 1
Changed Host monitoring int, checking old and new intervals
Posted: Tue Jan 17, 2017 9:11 am
by peter.zanetti
We monitor roughly 13k devices across three Nagios servers. We recently made a change on how these hosts were monitored. They used to get checked every 15 minutes and would alert after 5 checks (alert after 1 hour of being down). We changed them to be checked every hour and to alert after 25 checks (alert after 1 day of being down). The way I did this was updating the config files for each host with a script through the command line.
As you can see below, I have a host that is checking every hour with a counter of "X of 25" checking every hour, but it also has a counter of "X of 5" checking every 15 minutes. Upon hitting 5 of 5 an alert is created and is not supposed to be:
Capture.PNG
You can also see the host is showing to be configured with the 25 max check attempts:
Capture 2.PNG
Has anyone ever seen something like this? and if so how did you fix it?
Re: Changed Host monitoring int, checking old and new interv
Posted: Tue Jan 17, 2017 10:33 am
by rkennedy
Generally I have seen this due to having multiple nagios processes running at the same time.
If you run
ps -ef | grep nagios.cfg | grep -v grep you should (on a normal system) see two processes running.
Code: Select all
nagios 1734 1 0 2016 ? 00:05:12 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 2271 1734 0 2016 ? 00:01:36 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Notice the second one is spawned from the initial parent process. If you run it on your system, I suspect you'll see multiple results. The general way to fix this is kill off all current Nagios processes, and then start it fresh.
Re: Changed Host monitoring int, checking old and new interv
Posted: Tue Jan 17, 2017 10:47 am
by peter.zanetti
Just as you expected, across all three instances:
Capture 3.PNG
Capture 4.PNG
Capture 5.PNG
Would a reboot of the server fix this issue? If not, how do I kill these processes? and then to start it fresh is that just a
Re: Changed Host monitoring int, checking old and new interv
Posted: Tue Jan 17, 2017 1:28 pm
by rkennedy
Nice catch. You should be able to run pkill nagios then run a service nagios start - this will kill all of the processes out, and then start just the initial service.
Once it's down to one, things should function as expected.
Re: Changed Host monitoring int, checking old and new interv
Posted: Tue Jan 17, 2017 2:20 pm
by peter.zanetti
That seems to have worked well on one server:
Capture 8.PNG
But not so well on the other two.
For instance on this one Nagios process 11785 will not go away. I can run 'pkill nagios' and then 'ps -ef | grep nagios.cfg | grep -v grep' and the process will still be there:
Capture 6.PNG
And on this server its not taking care of any of the other processes:
Capture 7.PNG
Any ideas on how to get these kill these?
Re: Changed Host monitoring int, checking old and new interv
Posted: Tue Jan 17, 2017 2:53 pm
by peter.zanetti
Nevermind, I figured it out. I had to run 'kill -9 pid' to kill each of those stubborn process individually. All three servers seem to be back to normal. I will keep an eye on our monitoring the next few days to make sure this fixed the problem.
Thank you for all the help.
Re: Changed Host monitoring int, checking old and new interv
Posted: Tue Jan 17, 2017 3:26 pm
by dwhitfield
Do you want us to leave this open as you monitor or do you want us to lock it up?
Re: Changed Host monitoring int, checking old and new interv
Posted: Wed Jan 18, 2017 9:23 am
by peter.zanetti
Lets leave it open for now just in case
Re: Changed Host monitoring int, checking old and new interv
Posted: Wed Jan 18, 2017 9:49 am
by dwhitfield
Sounds good.