We monitor roughly 13k devices across three Nagios servers. We recently made a change on how these hosts were monitored. They used to get checked every 15 minutes and would alert after 5 checks (alert after 1 hour of being down). We changed them to be checked every hour and to alert after 25 checks (alert after 1 day of being down). The way I did this was updating the config files for each host with a script through the command line.
As you can see below, I have a host that is checking every hour with a counter of "X of 25" checking every hour, but it also has a counter of "X of 5" checking every 15 minutes. Upon hitting 5 of 5 an alert is created and is not supposed to be:
You can also see the host is showing to be configured with the 25 max check attempts:
Has anyone ever seen something like this? and if so how did you fix it?
Changed Host monitoring int, checking old and new intervals
-
peter.zanetti
- Posts: 90
- Joined: Wed Oct 01, 2014 8:34 am
Changed Host monitoring int, checking old and new intervals
You do not have the required permissions to view the files attached to this post.
Re: Changed Host monitoring int, checking old and new interv
Generally I have seen this due to having multiple nagios processes running at the same time.
If you run ps -ef | grep nagios.cfg | grep -v grep you should (on a normal system) see two processes running.
Notice the second one is spawned from the initial parent process. If you run it on your system, I suspect you'll see multiple results. The general way to fix this is kill off all current Nagios processes, and then start it fresh.
If you run ps -ef | grep nagios.cfg | grep -v grep you should (on a normal system) see two processes running.
Code: Select all
nagios 1734 1 0 2016 ? 00:05:12 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 2271 1734 0 2016 ? 00:01:36 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Former Nagios Employee
-
peter.zanetti
- Posts: 90
- Joined: Wed Oct 01, 2014 8:34 am
Re: Changed Host monitoring int, checking old and new interv
Just as you expected, across all three instances:
Would a reboot of the server fix this issue? If not, how do I kill these processes? and then to start it fresh is that just a
Would a reboot of the server fix this issue? If not, how do I kill these processes? and then to start it fresh is that just a
Code: Select all
service nagios restartYou do not have the required permissions to view the files attached to this post.
Re: Changed Host monitoring int, checking old and new interv
Nice catch. You should be able to run pkill nagios then run a service nagios start - this will kill all of the processes out, and then start just the initial service.
Once it's down to one, things should function as expected.
Once it's down to one, things should function as expected.
Former Nagios Employee
-
peter.zanetti
- Posts: 90
- Joined: Wed Oct 01, 2014 8:34 am
Re: Changed Host monitoring int, checking old and new interv
That seems to have worked well on one server:
But not so well on the other two.
For instance on this one Nagios process 11785 will not go away. I can run 'pkill nagios' and then 'ps -ef | grep nagios.cfg | grep -v grep' and the process will still be there: And on this server its not taking care of any of the other processes: Any ideas on how to get these kill these?
For instance on this one Nagios process 11785 will not go away. I can run 'pkill nagios' and then 'ps -ef | grep nagios.cfg | grep -v grep' and the process will still be there: And on this server its not taking care of any of the other processes: Any ideas on how to get these kill these?
You do not have the required permissions to view the files attached to this post.
-
peter.zanetti
- Posts: 90
- Joined: Wed Oct 01, 2014 8:34 am
Re: Changed Host monitoring int, checking old and new interv
Nevermind, I figured it out. I had to run 'kill -9 pid' to kill each of those stubborn process individually. All three servers seem to be back to normal. I will keep an eye on our monitoring the next few days to make sure this fixed the problem.
Thank you for all the help.
Thank you for all the help.
-
dwhitfield
- Former Nagios Staff
- Posts: 4583
- Joined: Wed Sep 21, 2016 10:29 am
- Location: NoLo, Minneapolis, MN
- Contact:
Re: Changed Host monitoring int, checking old and new interv
Do you want us to leave this open as you monitor or do you want us to lock it up?
-
peter.zanetti
- Posts: 90
- Joined: Wed Oct 01, 2014 8:34 am
Re: Changed Host monitoring int, checking old and new interv
Lets leave it open for now just in case
-
dwhitfield
- Former Nagios Staff
- Posts: 4583
- Joined: Wed Sep 21, 2016 10:29 am
- Location: NoLo, Minneapolis, MN
- Contact: