Page 1 of 1

Unexpected Notifications on Nagios server reboot

Posted: Mon Jul 02, 2018 10:10 pm
by steliopappas
Hi guys

I've noticed that when our Nagios server is rebooted, we are bombarded with a dozens, if not hundreds, of notifications.

This is my fourth Nagios environment, but I've never experienced this before. -The previous three were Nagios Core 3 & 4, and this environment is Nagios XI.

After the Nagios server was started, it began to send notifications for most, if not all hosts. -I didn't pay attention to how many as I was frantically trying to stop them at that stage.

The notifications were:

-all host based, ie none related to services.
-all informing that the host was in a flapping state, ie no notifications for UP/DOWN/UNREACHABLE state.

The first thing I suspected was that maybe I had cached host checks configured. My logic being that Nagios might only have a record of when devices were down, before the Nagios reboot, even though they were cleared by virtue of services that had been checked since. I wonder if it knew the services were up before the cache was cleared in the reboot? -This is just a hunch, and I'm open to other possibilities.

1. Can you tell me where to look to see if "Cached Host Checks" are enabled?

2. Can you think of any other possible reason that we would receive these notifications?

Thanks in advance
Stel

Re: Unexpected Notifications on Nagios server reboot

Posted: Tue Jul 03, 2018 10:01 am
by scottwilkerson
do these hosts share an underlying template that sets them to some unexpected state? Can you share the nagios.cfg for us to take a look at

Re: Unexpected Notifications on Nagios server reboot

Posted: Wed Jul 04, 2018 12:40 am
by steliopappas
I found that when nagios was rebooted, there were:

-a number of devices that were deemed to be in a down state.
-a number of devices that were deemed to be in a up state.

Here is how they looked when nagios started at around 12:13pm:

bxxxxxxxxx01 started in a DOWN state

Date / Time Host Service State State Type Attempt Information
2018-06-29 12:30:00 bxxxxxxxxx01 UP SOFT 2 of 3 OK - x.x.x.x: rta 28.112ms, lost 0%
2018-06-29 12:29:10 bxxxxxxxxx01 UNREACHABLE SOFT 1 of 3 CRITICAL - x.x.x.x: rta nan, lost 100%
2018-06-29 12:13:23 bxxxxxxxxx01 UP SOFT 2 of 3 OK - x.x.x.x: rta 34.230ms, lost 0%
2018-06-29 12:13:05 bxxxxxxxxx01 Internet Bandwidth OK SOFT 2 of 5 OK - Current BW in: 0Mbps Out: 0Mbps
2018-06-29 12:13:03 bxxxxxxxxx01 DOWN SOFT 1 of 3 CRITICAL - x.x.x.x: Host unreachable @ z.z.z.z. rta nan, lost 100%

I yellowed out the service related log lines to avoid confusion.

Note
x.x.x.x is the IP of the bxxxxxxxx01.
z.z.z.z is the IP of the nagios server.

The things that stand out to me are:
(i) at 12:13:03 it says host "unreachable" despite the state listed as "DOWN"
(ii) other hosts in our environment which are actually down, don't refer to the nagios server (z.z.z.z) in the logs.

I've shown the initial state config below. I can see that no initial state is configured on the host or any of the templates it depends on.

bxxxxxxxxx01 (host that started in a DOWN state)
Host Management > Check Settings > Initial state: = Not configured.
Host Management > Manage Templates: = xiwizard_switch_host

xiwizard_switch_host (template)
Host Template Management > Check Settings > Initial state: = Not configured.
Host Template Management > Manage Templates > xiwizard_generic_host

xiwizard_generic_host
Host Template Management > Check Settings > Initial state: = Not configured.
Host Template Management > Manage Templates > None listed


=====================================================================================================================

axxxxxxxxxxxx012 started in an UP state

There were no log entries recording a host state change in the time before and after 12:13:03 for this server.

I've shown the initial state config below. I can see that no initial state is configured on the host or any of the templates it depends on.

axxxxxxxxxxxx012(host that started in an UP state)
Host Management > Check Settings > Initial state: = Not configured.
Host Management > Manage Templates: = xiwizard_windowswmi_host

xiwizard_windowswmi_host(template)
Host Template Management > Check Settings > Initial state: = Not configured.
Host Template Management > Manage Templates > xiwizard_generic_host

xiwizard_generic_host
Host Template Management > Check Settings > Initial state: = Not configured.
Host Template Management > Manage Templates > None listed



It seems that both devices had no initial state configured. What is the default when none is set?

Do you have any other ideas why one would start as DOWN and another might start as UP?

I've attached the nagios.cfg file as requested.

Stel

Re: Unexpected Notifications on Nagios server reboot

Posted: Thu Jul 05, 2018 9:00 am
by scottwilkerson
One thing to point out from your assessment, if a host cannot be reached, AND it has a parent that cannot be reached it is marked unreachable. and is in a DOWN state.

This setup seems pretty standard, the only final thing I could think that could cause this would not really be able to be tested and that would be if for some reason there were multiple Nagios parent processes at the time of reboot and the state file contained erronious information.

Re: Unexpected Notifications on Nagios server reboot

Posted: Fri Jul 06, 2018 12:51 am
by steliopappas
Hi Scott

I started going through logs to try and identify why so many hosts were in a down state at start up.

I'm beginning to suspect that Nagios started up before the rest of the server was ready to start checking, causing the first host checks to fail.

The problem is that I'm an old SysV Linux/Unix admin who was focusing on Nagios when systemd whizzed by a few years ago.

I'll need to read up on systemd and dig up the Nagios XI installation instructions to see if it is configured correctly for start up. -The machine was already built when I started on this gig so I'm not sure if it was built the way it should be.

Stel

Re: Unexpected Notifications on Nagios server reboot

Posted: Fri Jul 06, 2018 9:07 am
by scottwilkerson
steliopappas wrote:Hi Scott

I started going through logs to try and identify why so many hosts were in a down state at start up.

I'm beginning to suspect that Nagios started up before the rest of the server was ready to start checking, causing the first host checks to fail.

The problem is that I'm an old SysV Linux/Unix admin who was focusing on Nagios when systemd whizzed by a few years ago.

I'll need to read up on systemd and dig up the Nagios XI installation instructions to see if it is configured correctly for start up. -The machine was already built when I started on this gig so I'm not sure if it was built the way it should be.

Stel
Let us know what you come up with

Re: Unexpected Notifications on Nagios server reboot

Posted: Mon Jul 09, 2018 11:53 pm
by steliopappas
Hi Scott

I made a small tweak but didn't have much luck. That said, I did find something interesting that might help in /var/log/messages, but I'm not keen to post it in a forum. Is there another way to get it to you?

I see there is a link to a Support Centre (https://support.nagios.com/tickets/) with a facility to open a "Ticket". Is the "Ticket" environment visible to the public, ie simply another link back to the forums or perhaps an option for a more private support process?

Thanks in advance
Stel

Re: Unexpected Notifications on Nagios server reboot

Posted: Tue Jul 10, 2018 8:18 am
by scottwilkerson
The ticket environment is not exposed to the public and would be the preferred method.

Please open a ticket and you can also reference this thread to give the technician some back story on the issue.

Thanks