I need to setup custom alert for CPU and Memory usage for a server.
when at 50% - Email specific set of people.
when at 60% - email everyone on the team
when at 70% - email every and SMS on call staff.
Below is my setup, but I think i made it complicated.
$ cat _WindowsAppProxy.cfg
#######################################################################
#AppProxy Special Notification Services
######################################################################
define service{
host_name !CWWAPP923, !TWWAPP630
hostgroup_name ds-webproxy-servers-prod
service_description Monitor App Proxy CPU Usage 50percent
check_command check_nt!CPULOAD!-l 5,50,50
active_checks_enabled 1
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 1
check_freshness 0
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
is_volatile 0
check_period 24x7
max_check_attempts 7
normal_check_interval 5
retry_check_interval 1
contact_groups AppProxyContacts
notification_options w,u,c,r
notification_interval 15
notification_period 24x7
}
define service{
host_name !CWWAPP923, !TWWAPP630
hostgroup_name ds-webproxy-servers-prod
service_description Monitor App Proxy CPU Usage 60percent
check_command check_nt!CPULOAD!-l 5,60,60
active_checks_enabled 1
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 1
check_freshness 0
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
is_volatile 0
check_period 24x7
max_check_attempts 7
normal_check_interval 5
retry_check_interval 1
contact_groups CSGDS, BOEscalation
notification_options w,u,c,r
notification_interval 15
notification_period 24x7
}
define service{
host_name !CWWAPP923, !TWWAPP630
hostgroup_name ds-webproxy-servers-prod
service_description Monitor App Proxy CPU Usage 70percent
check_command check_nt!CPULOAD!-l 5,70,70
active_checks_enabled 1
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 1
check_freshness 0
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
is_volatile 0
check_period 24x7
max_check_attempts 7
normal_check_interval 5
retry_check_interval 1
contact_groups CSGDS, BOsms
notification_options w,u,c,r
notification_interval 15
notification_period 24x7
}
define service{
host_name !CWWAPP923, !TWWAPP630
hostgroup_name ds-webproxy-servers-prod
service_description Monitor App Proxy Memory Usage 50percent
check_command check_nt!MEMUSE!-w 50
active_checks_enabled 1
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 1
check_freshness 0
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
is_volatile 0
check_period 24x7
max_check_attempts 7
normal_check_interval 5
retry_check_interval 1
contact_groups AppProxyContacts
notification_options w,u,c,r
notification_interval 15
notification_period 24x7
}
define service{
host_name !CWWAPP923, !TWWAPP630
hostgroup_name ds-webproxy-servers-prod
service_description Monitor App Proxy Memory Usage 60percent
check_command check_nt!MEMUSE!-w 60
active_checks_enabled 1
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 1
check_freshness 0
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
is_volatile 0
check_period 24x7
max_check_attempts 7
normal_check_interval 5
retry_check_interval 1
contact_groups CSGDS, BOEscalation
notification_options w,u,c,r
notification_interval 15
notification_period 24x7
}
define service{
host_name !CWWAPP923, !TWWAPP630
hostgroup_name ds-webproxy-servers-prod
service_description Monitor App Proxy Memory Usage 70percent
check_command check_nt!MEMUSE!-c 70
active_checks_enabled 1
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 1
check_freshness 0
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
is_volatile 0
check_period 24x7
max_check_attempts 7
normal_check_interval 5
retry_check_interval 1
contact_groups CSGDS, BOsms
notification_options w,u,c,r
notification_interval 15
notification_period 24x7
}
Alert
Re: Alert
It looks like the different checks have been setup accordingly, are they not working as you expected?
On a side note - usually with multiple services / different thresholds, we have escalations built in so you would not need 3 checks doing the same thing. They would allow you to escalate the issue up. https://assets.nagios.com/downloads/nag ... tions.html
On a side note - usually with multiple services / different thresholds, we have escalations built in so you would not need 3 checks doing the same thing. They would allow you to escalate the issue up. https://assets.nagios.com/downloads/nag ... tions.html
Former Nagios Employee
Re: Alert
The checks setup are working. I was just hoping I can reduce the number of checks created. The escalation link is for notification. I will like to escalate the service itself. If at 50, notify a group, if at 60 notify another. Seems like the escalation is design to help with after a specific number of notification, escalate to another group. For us, if its at 50% it will continue to notify the same set of people.
Re: Alert
It is designed to work with the amount of notifications, which should coincide with how often your notifications go out.
There won't be a way to reduce it down further, without using escalations at this point.
There won't be a way to reduce it down further, without using escalations at this point.
Former Nagios Employee
Re: Alert
@rk is right. You can't escalate based on the result (50%, 70%, etc), only the number of notifications. So you have to create different service checks that will trigger at different thresholds, and notify different groups (to which you could also apply escalations if you wanted). So multiple checks are required. You could set the 50% one as the base, then "use" that one in the other two, only overriding the check_command and contacts settings to change the thresholds and whom to notify.
On the other hand, you could use an event handler to do what you want with a single check. It would go something like this. Warning, what I'm about to discuss is an ADVANCED topic.
Turn off notifications for the service itself. Set the warning and critical thresholds to the maximum values you can tolerate (they will only be used on the GUI). Create an event handler that gets the following information passed to it:
You may want to refer to https://assets.nagios.com/downloads/nag ... dlers.html and https://assets.nagios.com/downloads/nag ... olist.html for more info.
From here, EVERY time the service check runs, you get the host, the service name, the state (OK, WARNING, or CRITICAL), the state type (HARD or SOFT), how many times it's been in that state if it's a SOFT state, and - most importantly, the performance data. Each plugin (optionally) returns performance data that can be used to make graphs or do other things with. check_cpu and check_disk and so forth (the stock Nagios plugins) all do that. So in /usr/local/nagios/libexec/eventhandlers/MyEventHandler, you have access to all those things.
You could create a script that checks the performance data for specific values, and then emails different people based on the result. Granted, this is sending notifications outside of Nagios (which means no escalations and no contact management), but you could teach your script to use Nagios APIs so that it tells Nagios to do the notification, sends escalations, and uses the contacts listed in Nagios. This is extra advanced.
On the other hand, you could use an event handler to do what you want with a single check. It would go something like this. Warning, what I'm about to discuss is an ADVANCED topic.
Turn off notifications for the service itself. Set the warning and critical thresholds to the maximum values you can tolerate (they will only be used on the GUI). Create an event handler that gets the following information passed to it:
Code: Select all
define service{
host_name somehost
service_description MyService
max_check_attempts 4
event_handler notify-MyService
...
}
define command{
command_name notify-MyService
command_line /usr/local/nagios/libexec/eventhandlers/MyEventHandler $HOSTNAME$ $SERVICEDESC$ $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $SERVICEPERFDATA$
}
From here, EVERY time the service check runs, you get the host, the service name, the state (OK, WARNING, or CRITICAL), the state type (HARD or SOFT), how many times it's been in that state if it's a SOFT state, and - most importantly, the performance data. Each plugin (optionally) returns performance data that can be used to make graphs or do other things with. check_cpu and check_disk and so forth (the stock Nagios plugins) all do that. So in /usr/local/nagios/libexec/eventhandlers/MyEventHandler, you have access to all those things.
You could create a script that checks the performance data for specific values, and then emails different people based on the result. Granted, this is sending notifications outside of Nagios (which means no escalations and no contact management), but you could teach your script to use Nagios APIs so that it tells Nagios to do the notification, sends escalations, and uses the contacts listed in Nagios. This is extra advanced.
Last edited by eloyd on Thu Sep 15, 2016 10:57 am, edited 3 times in total.
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoyd
I'm a Nagios Fanatic! • Join our public Nagios Discord Server!