Creating an alert for host stuck in boot loop

rferebee · Post by **rferebee** » Thu Feb 27, 2020 2:41 pm

Hello,

I received a request from one of my users today. We'd like to know if you know of a way to monitor or alert on a host that is stuck in a boot loop? Essentially what happened last week is one of our Exchange WAPs was stuck in a boot loop and was rebooting so quickly that it was beating the check interval of XI, so never reported an issue with the host.

I could change the check interval to be more aggressive, but I'm wondering if there is another solution you may know of? I've look in the forums and in the Nagios Exchange and I can't find anything useful.

Thank you!

Post by **mbellerue** » Thu Feb 27, 2020 4:52 pm

Just so I know I have this correct, the system would boot up far enough for Nagios to get an OK check from the server, and then crash/burn/reboot, back up in time to present the next OK check?

If that's the case, there might be something fancy we could do if the server has IPMI. Or there's probably an uptime check for Windows. If it's less than say 5 minutes, throw an alert. Would that work?

rferebee · Post by **rferebee** » Thu Feb 27, 2020 5:31 pm

Ok, so I was looking at the Uptime check and I'm curious how it alerts?

There don't seem to be any variables within the configuration of the service check, so I'm not sure how it determines OK vs Warning vs Critical.

Post by **mbellerue** » Thu Feb 27, 2020 5:46 pm

Oh, the check_uptime plugin that comes with Nagios is for Linux machines. I was thinking we could probably find one for Windows. I will search around. Maybe this would be a pretty easy check to make.

rferebee · Post by **rferebee** » Thu Feb 27, 2020 5:50 pm

Oh ok, the one I used was built into the Configuration Wizard for Windows Servers.

Post by **mbellerue** » Fri Feb 28, 2020 10:48 am

Oh! Of course, my apologies. I believe that's using the check_nt command. Are you running NSClient on the Windows machine?

rferebee · Post by **rferebee** » Fri Feb 28, 2020 11:15 am

Yes, we're running NSClient version 0.5.2.35

Post by **mbellerue** » Fri Feb 28, 2020 2:36 pm

Okay, I've got it here. If you specify the unit of measure, your warning/crit thresholds are based off of that. So you're looking for something like this,

Code: Select all

/usr/local/nagios/libexec/check_nt --host <yourhostIP> -p <portnum> -s <secrettoken> -v UPTIME -l minutes -c 5

Example:

Code: Select all

root@weylandxi:/usr/local/nagios/libexec# ./check_nt --host 192.168.145.90 -p 12489 -s ASecretToken -v UPTIME -l minutes -w 14120
System Uptime - 9 day(s) 19 hour(s) 17 minute(s) |uptime=14117
root@weylandxi:/usr/local/nagios/libexec# echo $?
1

rferebee · Post by **rferebee** » Mon Mar 02, 2020 11:54 am

Ok, this worked how I imagined it would, which is great.

Now I'm curious of there is a way we could massage the check intervals, so that a normal reboot doesn't trigger an alert? In my test scenario, I added the service check to a test host and then rebooted it. It triggered a critical alert as soon as the Nagios saw the uptime was 5 minutes or less.

We're using a check interval of 5, retry interval of 1 and max check attempts of 5. Would it be a good idea to increase the max check attempts to something between 7 and 10?

Thank you!

Post by **mbellerue** » Mon Mar 02, 2020 5:06 pm

That's where this solution gets a little tricky. If you expand the max retry, then the service goes critical, but never notifies, because it always goes OK during max retry.

The best thing I can think of right now is that you can use scheduled downtime for these servers. You put them in downtime at a certain time of day when you normally apply patches, and then everything else is either a valid alert, or you send a followup email stating why the server rebooted (emergency patches, or what-have-you).

Nagios Support Forum

Creating an alert for host stuck in boot loop

Creating an alert for host stuck in boot loop

Re: Creating an alert for host stuck in boot loop

Re: Creating an alert for host stuck in boot loop

Re: Creating an alert for host stuck in boot loop

Re: Creating an alert for host stuck in boot loop

Re: Creating an alert for host stuck in boot loop

Re: Creating an alert for host stuck in boot loop

Re: Creating an alert for host stuck in boot loop

Re: Creating an alert for host stuck in boot loop

Re: Creating an alert for host stuck in boot loop