Services Immediately Moving To Hard Critical Status
Posted: Wed Jan 16, 2019 1:31 pm
Hello,
Please bare with me, my Nagios knowledge is limited. I would be happy to provide any additional information.
I am focusing on one incident that occurred on Jan 13. The Nagios XI server lost connectivity to a group of servers (so host and service checks are both affected). I observed this behavior on hosts within that group. I should note that those servers are all configured similarly in Nagios.
I have attempted to configure host checks to go down (hard) before service checks. Host checks are done every 5 minutes, with a maximum retry attempts value of 3, with a retry interval of 1 minute. Service checks are done every 5 minutes, with a maximum retry attempts value of 5 with a retry interval of 1 minute.
I would expect that once a service goes down (soft), it will retry every minute, 5 times (4 more times). During this 5 minute window, I would expect a host check to occur, and fail, marking the host as soft down. I would then expect the service to continue its retries until it reaches the maximum number of retries, and then it turn to a hard critical state.
What I have observed is that after the host is marked as soft down, the next service check immediately goes to hard critical state, and notifications for the service checks are sent.
I have attached two images, one showing the state history for a particular host, and the second showing notifications that were sent from the host.
On the state history image, I have drawn some colored dots:
I apologize for writing so much, I greatly appreciate any insight into this situation. And again, I would be glad to provide any extra info.
Please bare with me, my Nagios knowledge is limited. I would be happy to provide any additional information.
- REHL 7.6
- 64 bit REHL 7.6
- Manual installation of Nagios XI
- We have enabled SSL for the web interface, the databases are run on a separate server (per Nagios directions PDF)
- Currently running Nagios XI 5.5.5
I am focusing on one incident that occurred on Jan 13. The Nagios XI server lost connectivity to a group of servers (so host and service checks are both affected). I observed this behavior on hosts within that group. I should note that those servers are all configured similarly in Nagios.
I have attempted to configure host checks to go down (hard) before service checks. Host checks are done every 5 minutes, with a maximum retry attempts value of 3, with a retry interval of 1 minute. Service checks are done every 5 minutes, with a maximum retry attempts value of 5 with a retry interval of 1 minute.
I would expect that once a service goes down (soft), it will retry every minute, 5 times (4 more times). During this 5 minute window, I would expect a host check to occur, and fail, marking the host as soft down. I would then expect the service to continue its retries until it reaches the maximum number of retries, and then it turn to a hard critical state.
What I have observed is that after the host is marked as soft down, the next service check immediately goes to hard critical state, and notifications for the service checks are sent.
I have attached two images, one showing the state history for a particular host, and the second showing notifications that were sent from the host.
On the state history image, I have drawn some colored dots:
- The blue dots are the host checks
- The red dots are when notifications were sent
- The green dots are services that immediately turned to hard critical status
- Two green dots indicate that the service immediately went to hard critical status, but notifications were not sent.
- Am I misunderstanding how this works?
- Should I configure things differently?
- Why did those two services with the two green dots not send notifications, unlike the other services that went to hard critical?
I apologize for writing so much, I greatly appreciate any insight into this situation. And again, I would be glad to provide any extra info.