Nagios Support Forum

Posted: **Tue Jan 17, 2017 9:11 am**

We monitor roughly 13k devices across three Nagios servers. We recently made a change on how these hosts were monitored. They used to get checked every 15 minutes and would alert after 5 checks (alert after 1 hour of being down). We changed them to be checked every hour and to alert after 25 checks (alert after 1 day of being down). The way I did this was updating the config files for each host with a script through the command line.

As you can see below, I have a host that is checking every hour with a counter of "X of 25" checking every hour, but it also has a counter of "X of 5" checking every 15 minutes. Upon hitting 5 of 5 an alert is created and is not supposed to be:

Capture.PNG

You can also see the host is showing to be configured with the 25 max check attempts:

Capture 2.PNG

Has anyone ever seen something like this? and if so how did you fix it?

Posted: **Tue Jan 17, 2017 10:33 am**

Generally I have seen this due to having multiple nagios processes running at the same time.

If you run ps -ef | grep nagios.cfg | grep -v grep you should (on a normal system) see two processes running.

Code: Select all

nagios    1734     1  0  2016 ?        00:05:12 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios    2271  1734  0  2016 ?        00:01:36 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

Notice the second one is spawned from the initial parent process. If you run it on your system, I suspect you'll see multiple results. The general way to fix this is kill off all current Nagios processes, and then start it fresh.

Posted: **Tue Jan 17, 2017 10:47 am**

Just as you expected, across all three instances:

Capture 3.PNG

Capture 4.PNG

Capture 5.PNG

Would a reboot of the server fix this issue? If not, how do I kill these processes? and then to start it fresh is that just a

Code: Select all

service nagios restart

Posted: **Tue Jan 17, 2017 1:28 pm**

Nice catch. You should be able to run pkill nagios then run a service nagios start - this will kill all of the processes out, and then start just the initial service.

Once it's down to one, things should function as expected.

Posted: **Tue Jan 17, 2017 2:20 pm**

That seems to have worked well on one server:

Capture 8.PNG

But not so well on the other two.
For instance on this one Nagios process 11785 will not go away. I can run 'pkill nagios' and then 'ps -ef | grep nagios.cfg | grep -v grep' and the process will still be there:

Capture 6.PNG

And on this server its not taking care of any of the other processes:

Capture 7.PNG

Any ideas on how to get these kill these?

Posted: **Tue Jan 17, 2017 2:53 pm**

Nevermind, I figured it out. I had to run 'kill -9 pid' to kill each of those stubborn process individually. All three servers seem to be back to normal. I will keep an eye on our monitoring the next few days to make sure this fixed the problem.

Thank you for all the help.

Posted: **Tue Jan 17, 2017 3:26 pm**

Do you want us to leave this open as you monitor or do you want us to lock it up?

Posted: **Wed Jan 18, 2017 9:23 am**

Lets leave it open for now just in case

Posted: **Wed Jan 18, 2017 9:49 am**

Sounds good.

Nagios Support Forum

Changed Host monitoring int, checking old and new intervals

Changed Host monitoring int, checking old and new intervals

Re: Changed Host monitoring int, checking old and new interv

Re: Changed Host monitoring int, checking old and new interv

Re: Changed Host monitoring int, checking old and new interv

Re: Changed Host monitoring int, checking old and new interv

Re: Changed Host monitoring int, checking old and new interv

Re: Changed Host monitoring int, checking old and new interv

Re: Changed Host monitoring int, checking old and new interv

Re: Changed Host monitoring int, checking old and new interv