Nagios Support Forum

Posted: **Wed Apr 23, 2014 5:03 am**

Hi All,

I would like to receive an email when a server reboots unexpectedly. I've read the thread http://support.nagios.com/forum/viewtop ... 16&t=24936 but didn't understand the solution that pacmag implemented so am hoping for some guidance there or other ideas.

I am mostly monitoring windows servers and am monitoring the uptime variable on those servers. I was hoping it would be relatively easy to either
a) receive a notification if the returned uptime is lower than a previous return value, or
b) receive a notification if the returned uptime is lower than x minutes.

Often a reboot (due to windows updates or an administrator trying the perennial "a reboot will fix it") can be achieved in far less time than the 5 minute checks and so happens without my knowledge.

Cheers,
Chris.

Posted: **Wed Apr 23, 2014 9:52 am**

Do you have a preferred plugin or preferred way of monitoring (agent, wmi, snmp) those systems? There definitely are ways of doing this, but if you are already monitoring in one way or another its probably best to work with that existing setup, rather than send you down a different path.

Posted: **Wed Apr 23, 2014 4:58 pm**

Hi Spenser,

I am using nsclient++ on these windows servers, generally 0.4.1, and using check_nt for the command to retrieve the uptime data.

I only do this because this is what the server wizard sets up - if there's an easier or better way I'm happy to set it up differently.

Cheers,
Chris.

Posted: **Wed Apr 23, 2014 9:52 pm**

Hi Chris,
You can do what you're saying relatively easily. However using the check_nt command for uptime is not going to work because it doesn't trigger warning or critical thresholds.

Howerver all is not lost. Instead you can use the check_nrpe command to query your Windows servers for the System Up Time performance counter (still using NSClient++) and then trigger alerts based on those thresholds.

From your Nagios host command line:

Code: Select all

check_nrpe -H <windows_server> -c CheckCounter -a "Counter=\System\System Up Time" ShowAll MinCrit=600

Which should respond with:

Code: Select all

OK: \System\System Up Time: 8921.99|'System Up Time'=8921.994197;0;600;

The number it returns is in seconds, so my system has been up for 148 minutes.

The MinCrit value of 600 equals 10 minutes. So when this check executes and the uptime is less than 10 minutes, it will trigger a critical status. Once the uptime is past 10 minutes the service will return to an OK state.

So when the service enters a critical state, whoever is a contact for that service will receive an alert (hence an email).

FYI when you setup your services for uptime, the performance counter will require double backslashes ...

Code: Select all

Counter=\\System\\System Up Time

Let me know how you go.

Troy

Posted: **Thu Apr 24, 2014 1:51 am**

Useful info Troy! Gonna try this too some day.

Posted: **Thu Apr 24, 2014 9:34 am**

Agreed, good post Troy! I'll have to make a note to add warning and critical values to that portion of check_nt.. unless we did that with plugins 2.0, I forget..

Posted: **Thu Apr 24, 2014 7:40 pm**

Cheers

I'm a big fan of performance counters ...

Posted: **Fri Apr 25, 2014 2:59 am**

Hi Troy,

thanks for the response. I tried the command as you entered above from the command line and initially received an error

Code: Select all

Request contained arguments (not currently allowed, check the allow arguments option).

I searched the forum and found some advice to add a line to the ini file

Code: Select all

allow arguments=1

which fixed that problem.

Then I got an illegal metacharacter error, so added

Code: Select all

allow_nasty_meta_chars=1

and all appears well. I shall add these services to this host and see how it goes over time.

Is there any down-side to allowing the arguments / nasty characters?

Thanks again,
Chris.

Posted: **Fri Apr 25, 2014 9:10 am**

Arguments on their own are fine, but if you allow metacharacters it can be a security risk. We put those characters in place to keep people from chaining commands and potentially compromising a system. For example, you could have check_nrpe call something like:

check_disk -w 20% -c 30% && cat /etc/passwd

for example, which is why we disallow the ampersand.

Posted: **Mon May 05, 2014 12:46 am**

Thanks for all the help everyone. This seems to be working quite well. I especially like that I can graph the uptime - the graphs don't look particularly nice but it keeps a history of the reboots which is a bonus.

Nagios Support Forum

Email on server reboot

Email on server reboot

Re: Email on server reboot

Re: Email on server reboot

Re: Email on server reboot

Re: Email on server reboot

Re: Email on server reboot

Re: Email on server reboot

Re: Email on server reboot

Re: Email on server reboot

Re: Email on server reboot