Service Status When Host Is Down

toodaly · Post by **toodaly** » Fri Jan 29, 2016 12:27 pm

What is the Nagios XI default expected response to a host being disconnected from the network?

What I expect is to see in the Nagios XI GUI is the host status as DOWN and the host's associated service statuses as UNKNOWN.

What was actually observed in the Nagios XI GUI is the host status transitioned to DOWN, but the services stayed at OK. The Nagios log shows the services as transitioning to CRITICAL due to "no route to host."

I thought I read somewhere that when something like this occurs, whether a host going down or a parent goes down in a parent/child relationship reporting of associated services or hosts will not be reported to prevent a flood of notifications. This would support what was observed where the host is down but the associated services were still showing OK.

If this is the default expected behavior, is there setting in the service definition to transition to UNKNOWN if a host is down instead of leaving it as OK?

I am using Nagios XI version 2012R2.9. I know it's an older version, but it's the one that our requirements were verified with a couple years ago.

Thanks

hsmith · Post by **hsmith** » Fri Jan 29, 2016 1:50 pm

Are the checks passive checks? That's not normal behavior.

toodaly · Post by **toodaly** » Fri Jan 29, 2016 2:00 pm

These were active service checks.

What is the normal behavior that I should expect?

Thanks

hsmith · Post by **hsmith** » Fri Jan 29, 2016 2:04 pm

When a service check is unable to communicate, it's generally going to go to a critical status. Depending on these checks, it could vary. Could you shed some light on what kind of checks these are?

toodaly · Post by **toodaly** » Fri Jan 29, 2016 2:33 pm

These were out-of-the-box checks using NSClient++ on a Windows 7 workstation, Disk usage, CPU usage, and Memory usage (check_nt, default 80 warning, 90 critical for all checks).

Here's what was in /var/logs/messages
Nov 19 17:37:56 NAGIOSXI001 nagios: HOST ALERT: LAX01OWS001;DOWN;SOFT;1;CRITICAL - 168.192.1.1: rta nan, lost 100%
Nov 19 17:39:06 NAGIOSXI001 nagios: HOST ALERT: LAX01OWS001;DOWN;SOFT;2;CRITICAL - 168.192.1.1: Host unreachable @ 168.192.0.133. rta nan, lost 100%
Nov 19 17:39:47 NAGIOSXI001 nagios: SERVICE ALERT: LAX01OWS001;Drive C: Disk Usage;CRITICAL;HARD;5;No route to host
Nov 19 17:40:16 NAGIOSXI001 nagios: SERVICE ALERT: LAX01OWS001;Ping;CRITICAL;HARD;1;CRITICAL - 168.192.1.1: Host unreachable @ 168.192.0.133. rta nan, lost 100%
Nov 19 17:40:16 NAGIOSXI001 nagios: SERVICE ALERT: LAX01OWS001;CPU Usage;CRITICAL;HARD;1;No route to host
Nov 19 17:40:16 NAGIOSXI001 nagios: SERVICE ALERT: LAX01OWS001;Memory Usage;CRITICAL;HARD;1;No route to host
Nov 19 17:40:16 NAGIOSXI001 nagios: HOST ALERT: LAX01OWS001;DOWN;SOFT;3;CRITICAL - 168.192.1.1: Host unreachable @ 168.192.0.133. rta nan, lost 100%
Nov 19 17:41:27 NAGIOSXI001 nagios: HOST ALERT: LAX01OWS001;DOWN;SOFT;4;CRITICAL - 168.192.1.1: Host unreachable @ 168.192.0.133. rta nan, lost 100%
Nov 19 17:42:37 NAGIOSXI001 nagios: HOST ALERT: LAX01OWS001;DOWN;HARD;5;CRITICAL - 168.192.1.1: Host unreachable @ 168.192.0.133. rta nan, lost 100%
Nov 19 17:42:37 NAGIOSXI001 nagios: HOST NOTIFICATION: system-admin01;LAX01OWS001;DOWN;xi_host_notification_handler;CRITICAL - 168.192.1.1: Host unreachable @ 168.192.0.133. rta nan, lost 100%
Nov 19 17:42:37 NAGIOSXI001 nagios: HOST NOTIFICATION: system-admin02;LAX01OWS001;DOWN;xi_host_notification_handler;CRITICAL - 168.192.1.1: Host unreachable @ 168.192.0.133. rta nan, lost 100%
Nov 19 17:49:47 NAGIOSXI001 nagios: SERVICE ALERT: LAX01OWS001;Drive C: Disk Usage;WARNING;HARD;5;C:\ - total: 119.24 Gb - used: 96.24 Gb (81%) - free 23.00 Gb (19%)
Nov 19 17:50:07 NAGIOSXI001 nagios: SERVICE ALERT: LAX01OWS001;Memory Usage;OK;HARD;1;Memory usage: total:33714.35 Mb - used: 3599.17 Mb (11%) - free: 30115.18 Mb (89%)
Nov 19 17:50:07 NAGIOSXI001 nagios: SERVICE ALERT: LAX01OWS001;CPU Usage;OK;HARD;1;CPU Load 9% (5 min average)
Nov 19 17:50:07 NAGIOSXI001 nagios: SERVICE ALERT: LAX01OWS001;Ping;OK;HARD;1;OK - 168.192.1.1: rta 0.292ms, lost 0%
Nov 19 17:50:16 NAGIOSXI001 nagios: HOST ALERT: LAX01OWS001;UP;HARD;1;OK - 168.192.1.1: rta 0.274ms, lost 0%
Nov 19 17:50:16 NAGIOSXI001 nagios: HOST NOTIFICATION: system-admin01;LAX01OWS001;UP;xi_host_notification_handler;OK - 168.192.1.1: rta 0.274ms, lost 0%
Nov 19 17:50:16 NAGIOSXI001 nagios: HOST NOTIFICATION: system-admin02;LAX01OWS001;UP;xi_host_notification_handler;OK - 168.192.1.1: rta 0.274ms, lost 0%

The log says things went to critical, but the Nagios XI Home->Details->Service Detail showed OK (green) this entire time for these services. Only the host Home->Details->Host Detail showed the host as Critical.

Could there have been a breakdown between Nagios Core and Nagios XI?

hsmith · Post by **hsmith** » Fri Jan 29, 2016 2:42 pm

It's unlikely, but it's possible. Is your system a clean minimal installation of RHEL/CentOS? We've seen desktop environments mess with a lot of things in the past.

Some things you can check:

Code: Select all

service nagios status
service mysqld status
tail /var/log/mysqld.log

toodaly · Post by **toodaly** » Fri Jan 29, 2016 3:04 pm

It's a clean install of RHEL 6.

I have a test plugin that I was going to use to monitor Nagios XI. Basically checks that the Nagios related services are running (nagios, ndo2db, npcd, mysqld) as well as checking partition utilization. I've had mysql crash due to the logs filling up the partition. Hopefully this will catch some anomalies we've seen.

Let me know if there are any other Nagios services or things a 5 minute interval check should be looking at to monitor Nagios XI or if there is a plugin that's already written.

Thanks.

hsmith · Post by **hsmith** » Fri Jan 29, 2016 3:10 pm

There are a lot of metrics on the 'system status' page under the 'admin' menu. Those are the main important services. crond is another huge one, not much is going to work if crond isn't running.

Nagios Support Forum

Service Status When Host Is Down

Service Status When Host Is Down

Re: Service Status When Host Is Down

Re: Service Status When Host Is Down

Re: Service Status When Host Is Down

Re: Service Status When Host Is Down

Re: Service Status When Host Is Down

Re: Service Status When Host Is Down

Re: Service Status When Host Is Down