Notification e-mail not sending to all contactgroup members

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
vauran
Posts: 12
Joined: Wed Mar 18, 2015 4:10 pm

Re: Notification e-mail not sending to all contactgroup memb

Post by vauran »

So nagios.debug is already giving me some good data:

[1426798329.049619] [032.1] [pid=19507] Notifications are temporarily disabled for this service, so we won't send one out.
[1426798329.049627] [032.0] [pid=19507] Notification viability test failed. No notification will be sent out.
[1426798219.201630] [032.1] [pid=19507] We shouldn't re-notify contacts about this service problem.


The above are the three prominent errors in the debug. Any advice on what they are?
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Notification e-mail not sending to all contactgroup memb

Post by tgriep »

I found this in the manual for Notifications
Format: enable_notifications=<0/1>
Example: enable_notifications=1

This option determines whether or not Nagios will send out notifications when it initially (re)starts.
If this option is disabled, Nagios will not send out notifications for any host or service.
Note: If you have state retention enabled, Nagios will ignore this setting when it (re)starts and use the
last known setting for this option (as stored in the state retention file), unless you disable the
use_retained_program_state option. If you want to change this option when state retention is active
(and the use_retained_program_state is enabled), you'll have to use the appropriate external command
or change it via the web interface. Values are as follows:
Can you check your nagios.cfg file for the state retention settings?
Be sure to check out our Knowledgebase for helpful articles and solutions!
vauran
Posts: 12
Joined: Wed Mar 18, 2015 4:10 pm

Re: Notification e-mail not sending to all contactgroup memb

Post by vauran »

These are the options:

retain_state_information=1
state_retention_file=/usr/local/nagios/var/retention.dat
use_retained_program_state=1
use_retained_program_state=1
(use_retained_program_state=1 is in there twice...)

I noticed that the following errors were happening on some servers that we have shut down:

[1426798329.049619] [032.1] [pid=19507] Notifications are temporarily disabled for this service, so we won't send one out.
[1426798329.049627] [032.0] [pid=19507] Notification viability test failed. No notification will be sent out.

So that may be the issue. I believe my next step will be to set up a simple SSH service check for one of our servers, make sure contactgroup is set to our Email - IT Support Group and then kill SSH (if it's not already off) on the server, while watching the debug file. Do you have a simple SSH code block to put into services.cfg for me? If not, I'll google to find one.

I've been working with Nagios for like 3 days and I'm already starting to feel tons more comfortable with it :). Thanks for all of your help so far.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: Notification e-mail not sending to all contactgroup memb

Post by jdalrymple »

I use check_ntp for doing what you want to do - /etc/init.d/sshd stop tends to be more troublesome to recover from without walking to a datacenter or opening an iLO session, and generally ntp can be stopped for short periods of time on even a production server safely.

I wish I was as far along after using Nagios for 3 days as you are.
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Notification e-mail not sending to all contactgroup memb

Post by tgriep »

You don't have to create a simple service check if you don't want to.
If you want to test Notifications, view the service that you are having issues with,
and submit a passive check, Select Critical as the check results and you should get notified.
At the next check, it will be up and you will get notified that it is up.
Be sure to check out our Knowledgebase for helpful articles and solutions!
vauran
Posts: 12
Joined: Wed Mar 18, 2015 4:10 pm

Re: Notification e-mail not sending to all contactgroup memb

Post by vauran »

Hey tgriep,

Forgot about that :). I've been using the command line this whole time and didn't even know I could do that through the GUI. I do see where to do that at, so I will do so tomorrow once I get into work. I'll let you guys know what I see from the nagios.debug file when I do a passive Critical check.

Thanks again for the help so far.
vauran
Posts: 12
Joined: Wed Mar 18, 2015 4:10 pm

Re: Notification e-mail not sending to all contactgroup memb

Post by vauran »

Hey All,

Good news! I did a passive check for one of our services (chose Critical status) and everyone in the contact group got the e-mail. Now, before we celebrate, I tried this on two service on another host and didn't get anything (even in nagios.debug). The service that's not sending out e-mails for Critical Passive Check is as follows:
*EDIT* it's important to note that the service notification that worked from the test was already in WARNING and at attempt 5/5. When I sent a Critical passive alert, it sent an e-mail. *END EDIT*

CPU Usage check services.cfg
define service {
host_name host1
check_command Linux-nrpe_check_cpu
max_check_attempts 10
normal_check_interval 10
retry_interval 5
active_checks_enabled 1
passive_checks_enabled 1
check_period 24x7
parallelize_check 1
obsess_over_service 1
check_freshness 1
freshness_threshold 0
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
notification_interval 0
notification_period 24x7
notifications_enabled 1
failure_prediction_enabled 1
service_description CPU Usages
display_name
notification_options u,c,r,f
stalking_options o,w,u,c
servicegroups CPU Usage
}

DISK2 check services.cfg
define service {
host_name host1
check_command Linux-nrpe_check_disk2
max_check_attempts 5
normal_check_interval 60
retry_interval 5
active_checks_enabled 1
passive_checks_enabled 1
check_period 24x7
parallelize_check 1
obsess_over_service 1
check_freshness 1
freshness_threshold 0
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
notification_interval 0
notification_period 24x7
notifications_enabled 1
failure_prediction_enabled 1
service_description DISK2
display_name
notification_options u,c,r,f
stalking_options o,w,u,c
servicegroups DISKS
}

CPU Usage status.dat
servicestatus {
host_name=host1
service_description=CPU Usages
modified_attributes=1
check_command=Linux-nrpe_check_cpu
check_period=24x7
notification_period=24x7
check_interval=10.000000
retry_interval=5.000000
event_handler=
has_been_checked=1
should_be_scheduled=1
check_execution_time=5.067
check_latency=0.250
check_type=0
current_state=0
last_hard_state=0
last_event_id=235503
current_event_id=235505
current_problem_id=0
last_problem_id=88924
current_attempt=1
max_attempts=10
state_type=1
last_state_change=1426859177
last_hard_state_change=1426150097
last_time_ok=1426859268
last_time_warning=1425536300
last_time_unknown=0
last_time_critical=1426858877
plugin_output=OK: CPU is 78% idle
long_plugin_output=
performance_data=
last_check=1426859268
next_check=1426859868
check_options=0
current_notification_number=0
current_notification_id=287449
last_notification=0
next_notification=0
no_more_notifications=0
notifications_enabled=1
active_checks_enabled=1
passive_checks_enabled=1
event_handler_enabled=1
problem_has_been_acknowledged=0
acknowledgement_type=0
flap_detection_enabled=1
failure_prediction_enabled=1
process_performance_data=1
obsess_over_service=1
last_update=1426859454
is_flapping=0
percent_state_change=0.00
scheduled_downtime_depth=0
}


DISK2 status.dat
servicestatus {
host_name=host1
service_description=DISK2
modified_attributes=1
check_command=Linux-nrpe_check_disk2
check_period=24x7
notification_period=24x7
check_interval=60.000000
retry_interval=5.000000
event_handler=
has_been_checked=1
should_be_scheduled=1
check_execution_time=0.020
check_latency=0.126
check_type=0
current_state=0
last_hard_state=0
last_event_id=235500
current_event_id=235502
current_problem_id=0
last_problem_id=88922
current_attempt=1
max_attempts=5
state_type=1
last_state_change=1426858750
last_hard_state_change=1381255611
last_time_ok=1426858750
last_time_warning=0
last_time_unknown=0
last_time_critical=1426858701
plugin_output=DISK OK - free space: /boot 68 MB (73% inode=99%):
long_plugin_output=
performance_data=/boot=24MB;78;88;0;98
last_check=1426858750
next_check=1426862350
check_options=0
current_notification_number=0
current_notification_id=285784
last_notification=0
next_notification=0
no_more_notifications=0
notifications_enabled=1
active_checks_enabled=1
passive_checks_enabled=1
event_handler_enabled=1
problem_has_been_acknowledged=0
acknowledgement_type=0
flap_detection_enabled=1
failure_prediction_enabled=1
process_performance_data=1
obsess_over_service=1
last_update=1426859454
is_flapping=0
percent_state_change=0.00
scheduled_downtime_depth=0
}

CPU Usage objects.cache
define service {
host_name host1
service_description CPU Usages
display_name
check_period 24x7
check_command Linux-nrpe_check_cpu
contact_groups Pagers,Email - Group IT Support
notification_period 24x7
initial_state o
check_interval 10.000000
retry_interval 5.000000
max_check_attempts 10
is_volatile 0
parallelize_check 1
active_checks_enabled 1
passive_checks_enabled 1
obsess_over_service 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options o,w,u,c
freshness_threshold 0
check_freshness 1
notification_options u,c,r,f
notifications_enabled 1
notification_interval 0.000000
first_notification_delay 0.000000
stalking_options o,u,w,c
process_perf_data 1
failure_prediction_enabled 1
retain_status_information 1
retain_nonstatus_information 1
}

DISK2 objects.cache
define service {
host_name host1
service_description DISK2
display_name
check_period 24x7
check_command Linux-nrpe_check_disk2
contact_groups Pagers,Email - Group IT Support
notification_period 24x7
initial_state o
check_interval 60.000000
retry_interval 5.000000
max_check_attempts 5
is_volatile 0
parallelize_check 1
active_checks_enabled 1
passive_checks_enabled 1
obsess_over_service 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options o,w,u,c
freshness_threshold 0
check_freshness 1
notification_options u,c,r,f
notifications_enabled 1
notification_interval 0.000000
first_notification_delay 0.000000
stalking_options o,u,w,c
process_perf_data 1
failure_prediction_enabled 1
retain_status_information 1
retain_nonstatus_information 1
}

This information was gathered after doing the passive checks against both. Got any idea why this isn't working? I have a feeling it's the max_check_attempts being a high number. They were currently at 1/10 and 1/5 respectively when I did the passive check. Any input would be appreciated.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: Notification e-mail not sending to all contactgroup memb

Post by jdalrymple »

vauran wrote:I have a feeling it's the max_check_attempts being a high number. They were currently at 1/10 and 1/5 respectively when I did the passive check. Any input would be appreciated.
Your feeling is right. A passive check result is only 1 check result. You'd have to send 10 or 5 respectively before a notification would be sent out, and you would have to do it before a normal active check returned the state to OK.

Make sense? I suspect so since you already had it figured out :D Instances like this are why we simulate failures instead of just submitting passive check results.
vauran
Posts: 12
Joined: Wed Mar 18, 2015 4:10 pm

Re: Notification e-mail not sending to all contactgroup memb

Post by vauran »

Awesome, thank you. I've learned a lot from this so I think I can continue troubleshooting this on my own.

Just for an overview for anyone who google's the issue I had:

Initial issue: E-mail notifications from Nagios were not sending to everyone in the contact group.

To troubleshoot this I added the following lines in /usr/local/nagios/etc/nagios.cfg while following the (great) documentation at http://nagios.sourceforge.net/docs/3_0/configmain.html:

debug_file=/usr/local/nagios/var/nagios.debug
debug_verbosity=2
max_debug_file_size=1000000
debug_level=32
***debug_level=32 is specific for notification debugging.***

While running "tail -f /usr/local/nagios/var/nagios.debug" I used the web GUI to run a passive check against an already 5/5 checked service, choosing the Critical option. This e-mailed the entire contact group set up in /usr/local/nagios/etc/objects/contactgroups.cfg (contacts in /usr/local/nagios/etc/objects/contacts.cfg) the notification e-mail. I tested another passive check against a service that was only 1/5 checked and 1/10 checked, and it did not produce anything since it will only notify on 10 occurrences of a Critical/Warning/whatever you have set in the services.cfg for that specific service. Looking at my services, a lot have the following:

notification_options u,c,r,f
stalking_options w,u,c

Which means that it won't alert on warning, but will on unknown, recovery, flapping, and critical
I will be changing this to add warning to notification_options, since I don't think the 'w' flag does anything in stalking_options.

Once again, thanks for the support guys. I believe I know enough now to troubleshoot specific service/host alerts not being sent out. Through my passive check tests, I proved that e-mails were working for the entire group and that only specific situations aren't working. If I need more help, I'll make sure to let you guys know
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: Notification e-mail not sending to all contactgroup memb

Post by jdalrymple »

Thanks for the wrap-up vauran. I'm going to lock this topic and mark it solved. Feel free to return for any other help!
Locked