Changed Host monitoring int, checking old and new intervals

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
peter.zanetti
Posts: 90
Joined: Wed Oct 01, 2014 8:34 am

Changed Host monitoring int, checking old and new intervals

Post by peter.zanetti »

We monitor roughly 13k devices across three Nagios servers. We recently made a change on how these hosts were monitored. They used to get checked every 15 minutes and would alert after 5 checks (alert after 1 hour of being down). We changed them to be checked every hour and to alert after 25 checks (alert after 1 day of being down). The way I did this was updating the config files for each host with a script through the command line.

As you can see below, I have a host that is checking every hour with a counter of "X of 25" checking every hour, but it also has a counter of "X of 5" checking every 15 minutes. Upon hitting 5 of 5 an alert is created and is not supposed to be:
Capture.PNG
You can also see the host is showing to be configured with the 25 max check attempts:
Capture 2.PNG
Has anyone ever seen something like this? and if so how did you fix it?
You do not have the required permissions to view the files attached to this post.
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Changed Host monitoring int, checking old and new interv

Post by rkennedy »

Generally I have seen this due to having multiple nagios processes running at the same time.

If you run ps -ef | grep nagios.cfg | grep -v grep you should (on a normal system) see two processes running.

Code: Select all

nagios    1734     1  0  2016 ?        00:05:12 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios    2271  1734  0  2016 ?        00:01:36 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Notice the second one is spawned from the initial parent process. If you run it on your system, I suspect you'll see multiple results. The general way to fix this is kill off all current Nagios processes, and then start it fresh.
Former Nagios Employee
peter.zanetti
Posts: 90
Joined: Wed Oct 01, 2014 8:34 am

Re: Changed Host monitoring int, checking old and new interv

Post by peter.zanetti »

Just as you expected, across all three instances:
Capture 3.PNG
Capture 4.PNG
Capture 5.PNG
Would a reboot of the server fix this issue? If not, how do I kill these processes? and then to start it fresh is that just a

Code: Select all

service nagios restart
You do not have the required permissions to view the files attached to this post.
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Changed Host monitoring int, checking old and new interv

Post by rkennedy »

Nice catch. You should be able to run pkill nagios then run a service nagios start - this will kill all of the processes out, and then start just the initial service.

Once it's down to one, things should function as expected.
Former Nagios Employee
peter.zanetti
Posts: 90
Joined: Wed Oct 01, 2014 8:34 am

Re: Changed Host monitoring int, checking old and new interv

Post by peter.zanetti »

That seems to have worked well on one server:
Capture 8.PNG
But not so well on the other two.
For instance on this one Nagios process 11785 will not go away. I can run 'pkill nagios' and then 'ps -ef | grep nagios.cfg | grep -v grep' and the process will still be there:
Capture 6.PNG
And on this server its not taking care of any of the other processes:
Capture 7.PNG
Any ideas on how to get these kill these?
You do not have the required permissions to view the files attached to this post.
peter.zanetti
Posts: 90
Joined: Wed Oct 01, 2014 8:34 am

Re: Changed Host monitoring int, checking old and new interv

Post by peter.zanetti »

Nevermind, I figured it out. I had to run 'kill -9 pid' to kill each of those stubborn process individually. All three servers seem to be back to normal. I will keep an eye on our monitoring the next few days to make sure this fixed the problem.

Thank you for all the help.
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: Changed Host monitoring int, checking old and new interv

Post by dwhitfield »

Do you want us to leave this open as you monitor or do you want us to lock it up?
peter.zanetti
Posts: 90
Joined: Wed Oct 01, 2014 8:34 am

Re: Changed Host monitoring int, checking old and new interv

Post by peter.zanetti »

Lets leave it open for now just in case
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: Changed Host monitoring int, checking old and new interv

Post by dwhitfield »

Sounds good.
Locked