Creating an alert for host stuck in boot loop
Creating an alert for host stuck in boot loop
Hello,
I received a request from one of my users today. We'd like to know if you know of a way to monitor or alert on a host that is stuck in a boot loop? Essentially what happened last week is one of our Exchange WAPs was stuck in a boot loop and was rebooting so quickly that it was beating the check interval of XI, so never reported an issue with the host.
I could change the check interval to be more aggressive, but I'm wondering if there is another solution you may know of? I've look in the forums and in the Nagios Exchange and I can't find anything useful.
Thank you!
I received a request from one of my users today. We'd like to know if you know of a way to monitor or alert on a host that is stuck in a boot loop? Essentially what happened last week is one of our Exchange WAPs was stuck in a boot loop and was rebooting so quickly that it was beating the check interval of XI, so never reported an issue with the host.
I could change the check interval to be more aggressive, but I'm wondering if there is another solution you may know of? I've look in the forums and in the Nagios Exchange and I can't find anything useful.
Thank you!
Re: Creating an alert for host stuck in boot loop
Just so I know I have this correct, the system would boot up far enough for Nagios to get an OK check from the server, and then crash/burn/reboot, back up in time to present the next OK check?
If that's the case, there might be something fancy we could do if the server has IPMI. Or there's probably an uptime check for Windows. If it's less than say 5 minutes, throw an alert. Would that work?
If that's the case, there might be something fancy we could do if the server has IPMI. Or there's probably an uptime check for Windows. If it's less than say 5 minutes, throw an alert. Would that work?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Creating an alert for host stuck in boot loop
Ok, so I was looking at the Uptime check and I'm curious how it alerts?
There don't seem to be any variables within the configuration of the service check, so I'm not sure how it determines OK vs Warning vs Critical.
There don't seem to be any variables within the configuration of the service check, so I'm not sure how it determines OK vs Warning vs Critical.
Re: Creating an alert for host stuck in boot loop
Oh, the check_uptime plugin that comes with Nagios is for Linux machines. I was thinking we could probably find one for Windows. I will search around. Maybe this would be a pretty easy check to make.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Creating an alert for host stuck in boot loop
Oh ok, the one I used was built into the Configuration Wizard for Windows Servers.
Re: Creating an alert for host stuck in boot loop
Oh! Of course, my apologies. I believe that's using the check_nt command. Are you running NSClient on the Windows machine?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Creating an alert for host stuck in boot loop
Yes, we're running NSClient version 0.5.2.35
Re: Creating an alert for host stuck in boot loop
Okay, I've got it here. If you specify the unit of measure, your warning/crit thresholds are based off of that. So you're looking for something like this,
Example:
Code: Select all
/usr/local/nagios/libexec/check_nt --host <yourhostIP> -p <portnum> -s <secrettoken> -v UPTIME -l minutes -c 5Code: Select all
root@weylandxi:/usr/local/nagios/libexec# ./check_nt --host 192.168.145.90 -p 12489 -s ASecretToken -v UPTIME -l minutes -w 14120
System Uptime - 9 day(s) 19 hour(s) 17 minute(s) |uptime=14117
root@weylandxi:/usr/local/nagios/libexec# echo $?
1As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Creating an alert for host stuck in boot loop
Ok, this worked how I imagined it would, which is great.
Now I'm curious of there is a way we could massage the check intervals, so that a normal reboot doesn't trigger an alert? In my test scenario, I added the service check to a test host and then rebooted it. It triggered a critical alert as soon as the Nagios saw the uptime was 5 minutes or less.
We're using a check interval of 5, retry interval of 1 and max check attempts of 5. Would it be a good idea to increase the max check attempts to something between 7 and 10?
Thank you!
Now I'm curious of there is a way we could massage the check intervals, so that a normal reboot doesn't trigger an alert? In my test scenario, I added the service check to a test host and then rebooted it. It triggered a critical alert as soon as the Nagios saw the uptime was 5 minutes or less.
We're using a check interval of 5, retry interval of 1 and max check attempts of 5. Would it be a good idea to increase the max check attempts to something between 7 and 10?
Thank you!
Re: Creating an alert for host stuck in boot loop
That's where this solution gets a little tricky. If you expand the max retry, then the service goes critical, but never notifies, because it always goes OK during max retry.
The best thing I can think of right now is that you can use scheduled downtime for these servers. You put them in downtime at a certain time of day when you normally apply patches, and then everything else is either a valid alert, or you send a followup email stating why the server rebooted (emergency patches, or what-have-you).
The best thing I can think of right now is that you can use scheduled downtime for these servers. You put them in downtime at a certain time of day when you normally apply patches, and then everything else is either a valid alert, or you send a followup email stating why the server rebooted (emergency patches, or what-have-you).
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!