Disabled Hosts Still Monitored

jbennett · Post by **jbennett** » Tue Aug 07, 2012 8:42 am

I have a handful of hosts that are still sending alerts even though they are marked as in-active.

I have tried the following:

Code: Select all

service nagios stop
killall -9 nagios
service nagios start

Code: Select all

service mysqld stop
/usr/local/nagiosxi/scripts/repairmysql.sh nagios
service mysqld start

I have even rebooted the server.

When I click for information about the hosts (orange !), I see that it is still listed under a service, however, when I go to that service, it is not selected.

I have applied configuration and I have manually written configs from the tools menu within CCM.

I have also noticed that the host notifications do not include the description. I used the bulk import wizard to import these hosts, and upon doing so, the description field was not populated, even though it was mapped. I went back and added this field manually and applied the configuration, but it seems like the emails are still pointing to old host files?

When I check /usr/local/nagios/etc/hosts, I see these hosts listed. When opening the .cfg file for the host, I see that it is very limited in information. It's as if all of the changes that I've made, and which show up on the host in the CCM, are not being reflected in the actual host .cfg file.

My next move is to completely delete these hosts and re-add them, but that creates the work all over again.

Any help would be greatly appreciated!!

yancy · Post by **yancy** » Tue Aug 07, 2012 10:30 am

jbennett,

What process did you use to mark the hosts as in-active?

Regards,

-Yancy

jbennett · Post by **jbennett** » Tue Aug 07, 2012 11:10 am

in each host, unchecked the 'active' check box.

I have managed to get the alias to update.

I went through and deleted the .cfg files for each of these items then applied the config again. The .cfg files were rewritten, as expected. Then i was able to go in and mark them active (check box) then apply the config agian.

At this point, the alias names were populated finally.

I have then gone back through and removed these hosts from their associated services, applied config once more.

Now I am going back through and marking them in-active once again and trying to apply the config, once again to see if it works.

EDIT: This process seems to have worked. I'm not sure why it didn't work initially.

Now, I need to figure out why my RTA on the ping is taking minutes from the Nagios box, but only on these specific hosts.

yancy · Post by **yancy** » Tue Aug 07, 2012 3:15 pm

jbennet,

Now, I need to figure out why my RTA on the ping is taking minutes from the Nagios box, but only on these specific hosts.

Are you referring to the RTA using the ping command from the command line, or check_ping? Are they different RTA values?

Regards,

-Yancy

jbennett · Post by **jbennett** » Tue Aug 07, 2012 3:47 pm

The RTA from 'check_host_alive' (check_icmp) is showing up as over 3 minutes. However, if I ping from my box, it is just milliseconds. If I ping from the Nagios box command line, it is taking a while as well.

Again, all other hosts are fine, it is only these new hosts (all the same piece of equipment) that are returning a long RTA (and thus showing as critical in Nagios).

yancy · Post by **yancy** » Tue Aug 07, 2012 4:01 pm

jbennet,

The check_host_alive doesn't look to be the same as check_icmp.

http://assets.nagios.com/downloads/nagi ... _In_XI.pdf

it's used if your trying to check the status of a machine that has a firewall rule enabled that won't allow icmp traffic. Maybe switch to check_ping instead if that works.

Regards,

-Yancy

jbennett · Post by **jbennett** » Wed Aug 08, 2012 8:55 am

OK, when I run the checks from the Nagios box command line, these are the results:

Code: Select all

[root@nagiosxivm libexec]# ./check_ping -H xxx.xxx.xxx.xxx -w 3000.0,80% -c 5000.0,100% -p 5
PING CRITICAL - Packet loss = 0%, RTA = 1796353.12 ms|rta=1796353.125000ms;3000.000000;5000.000000;0.000000 pl=0%;80;100;0

Code: Select all

[root@nagiosxivm libexec]# ./check_icmp -H xxx.xxx.xxx.xxx
CRITICAL - 10.100.158.121: rta 2065656.792ms, lost 0%|rta=2065656.792ms;200.000;500.000;0; pl=0%;40;80;;

I'm not sure i understand, because when I check other pieces of equipment at the same location, I get the following results:

Code: Select all

[root@nagiosxivm libexec]# ./check_ping -H xxx.xxx.xxx.xxx -w 3000.0,80% -c 5000.0,100% -p 5
PING OK - Packet loss = 0%, RTA = 11.61 ms|rta=11.613000ms;3000.000000;5000.000000;0.000000 pl=0%;80;100;0

Code: Select all

[root@nagiosxivm libexec]# ./check_icmp -H xxx.xxx.xxx.xxx
OK - 10.100.158.131: rta 10.310ms, lost 0%|rta=10.310ms;200.000;500.000;0; pl=0%;40;80;;

I'm not sure I understand the discrepancy.

yancy · Post by **yancy** » Wed Aug 08, 2012 9:43 am

jbennet,

Just to verify, if you use ping from the command line the results are as expected?

Code: Select all

ping xxx.xxx.xxx.xxx

I just want to rule out any network connectivity or congestion issues.

Regards,

-Yancy

jbennett · Post by **jbennett** » Wed Aug 08, 2012 12:00 pm

Code: Select all

[root@nagiosxivm ~]# ping xxx.xxx.xxx.xxx
PING xxx.xxx.xxx.xxx (xxx.xxx.xxx.xxx) 56(84) bytes of data.
40 bytes from xxx.xxx.xxx.xxx: icmp_seq=1 ttl=123 (truncated)
40 bytes from xxx.xxx.xxx.xxx: icmp_seq=2 ttl=123 (truncated)
40 bytes from xxx.xxx.xxx.xxx: icmp_seq=3 ttl=123 (truncated)
40 bytes from xxx.xxx.xxx.xxx: icmp_seq=4 ttl=123 (truncated)
^C
--- xxx.xxx.xxx.xxx ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3910ms
rtt min/avg/max/mdev = 0.000/0.000/0.000/0.000 ms
[root@nagiosxivm ~]# ping xxx.xxx.xxx.xxx
PING xxx.xxx.xxx.xxx (xxx.xxx.xxx.xxx) 56(84) bytes of data.
64 bytes from xxx.xxx.xxx.xxx: icmp_seq=1 ttl=59 time=8.50 ms
64 bytes from xxx.xxx.xxx.xxx: icmp_seq=2 ttl=59 time=8.12 ms
64 bytes from xxx.xxx.xxx.xxx: icmp_seq=3 ttl=59 time=8.08 ms
64 bytes from xxx.xxx.xxx.xxx: icmp_seq=4 ttl=59 time=8.12 ms
^C
--- xxx.xxx.xxx.xxx ping statistics ---
8 packets transmitted, 8 received, 0% packet loss, time 7781ms
rtt min/avg/max/mdev = 7.916/8.178/8.513/0.230 ms

The first one is the device that's having trouble, the second one is another device at the same location that is NOT having trouble. If I ping from my desktop (Windows), I get normal times for both devices. I get the exact same result from the box on which the Nagios VM resides as well (Windows).

Code: Select all

Pinging xxx.xxx.xxx.xxx with 32 bytes of data:
Reply from xxx.xxx.xxx.xxx: bytes=32 time=7ms TTL=124
Reply from xxx.xxx.xxx.xxx: bytes=32 time=6ms TTL=124
Reply from xxx.xxx.xxx.xxx: bytes=32 time=6ms TTL=124
Reply from xxx.xxx.xxx.xxx: bytes=32 time=7ms TTL=124

I'm wondering if it would just make sense to remove the time constraint and just make sure we get pings period.

yancy · Post by **yancy** » Wed Aug 08, 2012 1:13 pm

check_icmp and check_ping both require warning and critical levels.

can you check the file permissions on both of those files, and see if they match the sample bellow:

[root@localhost libexec]# ll check_icmp
-r-sr-xr-x 1 root root 107822 May 30 07:14 check_icmp
[root@localhost libexec]# ll check_ping
-rwxr-xr-x 1 root root 123568 May 30 07:14 check_ping

regards,

-Yancy

Nagios Support Forum

Disabled Hosts Still Monitored

Disabled Hosts Still Monitored

Re: Disabled Hosts Still Monitored

Re: Disabled Hosts Still Monitored

Re: Disabled Hosts Still Monitored

Re: Disabled Hosts Still Monitored

Re: Disabled Hosts Still Monitored

Re: Disabled Hosts Still Monitored

Re: Disabled Hosts Still Monitored

Re: Disabled Hosts Still Monitored

Re: Disabled Hosts Still Monitored