Page 1 of 2

Notifications not being sent to all members of Contact Group

Posted: Wed Jun 25, 2014 3:07 pm
by crnelson
System Info
Linux Distribution and version?
Distributor ID: CentOS
Release: 5.10
Linux 2.6.18-371.4.1.el5PAE #1 SMP Thu Jan 30 06:51:58 EST 2014 i686 i686 i386 GNU/Linux

32 or 64bit?
32 bit

VMware Image or Manual Install of XI?
VMware Image

Are there specials configurations on your system, ie; is Gnome installed? Are you using a proxy? Are you using SSL?
Yes, using SSL.


ISSUE
A monitored server had a HDD go critical this morning and notifications were sent; however, they were not sent to all contacts.
Facts:
- This particular monitor (Drive G) is applied to a Host Group.
- This monitor is setup to notify a Contact Group.
- All members of said Contact Group were notified, except the one for PagerDuty. This was viewable under Notifications.

Testing:
- Triggering other Drive G notifications for hosts in the same Host Group notifies ALL members of said Contact Group (verified in Notifications).
- Removed the host completely and recreated it.
- Triggered another notification of new host, but no Notifications was shown for the PagerDuty contact. All other contacts were shown in Notifications.
- Validated once more that the PagerDuty contact was a part of the applicable Contact Group.

Re: Notifications not being sent to all members of Contact G

Posted: Wed Jun 25, 2014 3:10 pm
by slansing
Can you share the configurations for your host, an associated service you were expecting alerts for, and the contacts which did not receive notifications? "Please blank our their addresses"

You can view your configurations via accessing the CCM and clicking the blue diskette icon next the the configuration's table in the list.

Re: Notifications not being sent to all members of Contact G

Posted: Wed Jun 25, 2014 3:37 pm
by crnelson
Selection_001.png
Selection_002.png
Selection_003.png

Re: Notifications not being sent to all members of Contact G

Posted: Wed Jun 25, 2014 3:39 pm
by crnelson
Selection_004.png
Selection_005.png
Selection_006.png

Re: Notifications not being sent to all members of Contact G

Posted: Wed Jun 25, 2014 3:40 pm
by crnelson
Selection_007.png
As previously mentioned, the contact and contact group works with other hosts, just not "this" one.

Re: Notifications not being sent to all members of Contact G

Posted: Wed Jun 25, 2014 8:34 pm
by Box293
Could you also see if there were any entries in /var/log/messages for when the problem occurred (both the original problem and the tests). Paste them here in a code block please.

Re: Notifications not being sent to all members of Contact G

Posted: Thu Jun 26, 2014 4:29 pm
by crnelson
Sorry for the slow response. Here's an excerpt from the log during the original failure that prompted my investigation.

Code: Select all

Jun 25 05:21:01 fwapp003 xinetd[3580]: START: nrpe pid=443 from=*.*.*.24
Jun 25 05:21:01 fwapp003 xinetd[3580]: EXIT: nrpe status=0 pid=443 duration=0(sec)
Jun 25 05:21:02 fwapp003 xinetd[3580]: START: nrpe pid=543 from=*.*.*.24
Jun 25 05:21:02 fwapp003 xinetd[3580]: START: nrpe pid=544 from=*.*.*.24
Jun 25 05:21:02 fwapp003 xinetd[3580]: START: nrpe pid=545 from=*.*.*.24
Jun 25 05:21:02 fwapp003 xinetd[3580]: START: nrpe pid=546 from=*.*.*.24
Jun 25 05:21:02 fwapp003 xinetd[3580]: START: nrpe pid=547 from=*.*.*.24
Jun 25 05:21:02 fwapp003 xinetd[3580]: EXIT: nrpe status=0 pid=547 duration=0(sec)
Jun 25 05:21:02 fwapp003 xinetd[3580]: EXIT: nrpe status=0 pid=545 duration=0(sec)
Jun 25 05:21:02 fwapp003 xinetd[3580]: EXIT: nrpe status=0 pid=546 duration=0(sec)
Jun 25 05:21:02 fwapp003 xinetd[3580]: EXIT: nrpe status=0 pid=543 duration=0(sec)
Jun 25 05:21:02 fwapp003 xinetd[3580]: EXIT: nrpe status=0 pid=544 duration=0(sec)
Jun 25 05:21:03 fwapp003 nagios: SERVICE ALERT: *201;Apache Web Server;CRITICAL;SOFT;2;[25.06.2014 05:21:00 SYSTEM] watchdog for monitor is not running. 
Jun 25 05:21:03 fwapp003 ndo2db: Error: mysql_query() failed for 'INSERT INTO nagios_statehistory SET instance_id='1', state_time=FROM_UNIXTIME(1403698863), state_time_usec='284697', object_id='433', state_change
='1', state='2', state_type='0', current_check_attempt='2', max_check_attempts='5', last_state='2', last_hard_state='0', output='\[25\.06\.2014 05:21:00 SYSTEM\] watchdog for monitor is not running\.', long_outpu
t='\[25\.06\.2014 05:21:00 SYSTEM\] watchdog for monitor is not running\.'' 
Jun 25 05:21:03 fwapp003 ndo2db: mysql_error: 'Table './nagios/nagios_statehistory' is marked as crashed and last (automatic?) repair failed' 
Jun 25 05:21:12 fwapp003 nagios: SERVICE NOTIFICATION: stb*;*008;Drive_G: Disk Usage;CRITICAL;notify-service-by-email;G:\ - total: 75.00 Gb - used: 73.77 Gb (98%) - free 1.23 Gb (2%) 
Jun 25 05:21:12 fwapp003 nagios: SERVICE NOTIFICATION: rmc*;*008;Drive_G: Disk Usage;CRITICAL;notify-service-by-email;G:\ - total: 75.00 Gb - used: 73.77 Gb (98%) - free 1.23 Gb (2%) 
Jun 25 05:21:12 fwapp003 nagios: SERVICE NOTIFICATION: mpr*;*008;Drive_G: Disk Usage;CRITICAL;notify-service-by-email;G:\ - total: 75.00 Gb - used: 73.77 Gb (98%) - free 1.23 Gb (2%) 
Jun 25 05:21:12 fwapp003 nagios: SERVICE NOTIFICATION: eca*;*008;Drive_G: Disk Usage;CRITICAL;notify-service-by-email;G:\ - total: 75.00 Gb - used: 73.77 Gb (98%) - free 1.23 Gb (2%) 
Jun 25 05:21:12 fwapp003 xinetd[3580]: START: nrpe pid=1285 from=*.*.*.10
Jun 25 05:21:12 fwapp003 xinetd[3580]: EXIT: nrpe status=0 pid=1285 duration=0(sec)
Jun 25 05:21:22 fwapp003 xinetd[3580]: START: nrpe pid=1578 from=*.*.*.24
Jun 25 05:21:22 fwapp003 xinetd[3580]: EXIT: nrpe status=0 pid=1578 duration=0(sec)
Jun 25 05:21:44 fwapp003 xinetd[3580]: START: nrpe pid=2631 from=*.*.*.10
Jun 25 05:21:44 fwapp003 xinetd[3580]: EXIT: nrpe status=0 pid=2631 duration=0(sec)
Jun 25 05:21:59 fwapp003 xinetd[3580]: START: nrpe pid=3126 from=*.*.*.24

Re: Notifications not being sent to all members of Contact G

Posted: Thu Jun 26, 2014 4:39 pm
by crnelson
Here's an excerpt from a test against another host

The following line is expected because I sent an OK notification. What I wanted to see was the contact trigger, which it did.
"Jun 25 12:15:34 fwapp003 pagerduty_nagios[626]: Nagios event in file /tmp/pagerduty_nagios/pd_1403723733_626.txt REJECTED by the PagerDuty server. Server says: The NOTIFICATIONTYPE field must be present and must be one of: PROBLEM, ACKNOWLEDGEMENT, RECOVERY, NOP. "

Full log with abstracted hostnames, user names, and IPs

Code: Select all

Jun 25 12:15:32 fwapp003 nagios: SERVICE NOTIFICATION: stb*;*011;Drive_G: Disk Usage;CUSTOM (OK);notify-service-by-wv_alerts-with-additional-addresses;G:\ - total: 250.00 Gb - used: 19.69 Gb (8%) - free 230.31 Gb (92%);crn*;TEST NOTIFY 
Jun 25 12:15:32 fwapp003 nagios: SERVICE NOTIFICATION: stb*;*011;Drive_G: Disk Usage;CUSTOM (OK);notify-service-by-email;G:\ - total: 250.00 Gb - used: 19.69 Gb (8%) - free 230.31 Gb (92%);crn*;TEST NOTIFY 
Jun 25 12:15:32 fwapp003 nagios: SERVICE NOTIFICATION: stb*;*011;Drive_G: Disk Usage;CUSTOM (OK);notify-service-by-email;G:\ - total: 250.00 Gb - used: 19.69 Gb (8%) - free 230.31 Gb (92%);crn*;TEST NOTIFY 
Jun 25 12:15:32 fwapp003 nagios: SERVICE NOTIFICATION: st*;*011;Drive_G: Disk Usage;CUSTOM (OK);xi_service_notification_handler;G:\ - total: 250.00 Gb - used: 19.69 Gb (8%) - free 230.31 Gb (92%);crn*;TEST NOTIFY 
Jun 25 12:15:33 fwapp003 nagios: SERVICE NOTIFICATION: rmc*;*011;Drive_G: Disk Usage;CUSTOM (OK);notify-service-by-wv_alerts-with-additional-addresses;G:\ - total: 250.00 Gb - used: 19.69 Gb (8%) - free 230.31 Gb (92%);crn*;TEST NOTIFY 
Jun 25 12:15:33 fwapp003 nagios: SERVICE NOTIFICATION: rmc*;*011;Drive_G: Disk Usage;CUSTOM (OK);notify-service-by-email;G:\ - total: 250.00 Gb - used: 19.69 Gb (8%) - free 230.31 Gb (92%);crn*;TEST NOTIFY 
Jun 25 12:15:33 fwapp003 nagios: SERVICE NOTIFICATION: rmc*;*011;Drive_G: Disk Usage;CUSTOM (OK);notify-service-by-email;G:\ - total: 250.00 Gb - used: 19.69 Gb (8%) - free 230.31 Gb (92%);crn*;TEST NOTIFY 
Jun 25 12:15:33 fwapp003 nagios: SERVICE NOTIFICATION: rmc*;*011;Drive_G: Disk Usage;CUSTOM (OK);xi_service_notification_handler;G:\ - total: 250.00 Gb - used: 19.69 Gb (8%) - free 230.31 Gb (92%);crn*;TEST NOTIFY 
Jun 25 12:15:33 fwapp003 nagios: SERVICE NOTIFICATION: pagerduty_MSSQL_PROD;*011;Drive_G: Disk Usage;CUSTOM (OK);notify-service-by-pagerduty;G:\ - total: 250.00 Gb - used: 19.69 Gb (8%) - free 230.31 Gb (92%);crn*;TEST NOTIFY 
Jun 25 12:15:34 fwapp003 pagerduty_nagios[626]: Nagios event in file /tmp/pagerduty_nagios/pd_1403723733_626.txt REJECTED by the PagerDuty server.  Server says: The NOTIFICATIONTYPE field must be present and must be one of: PROBLEM, ACKNOWLEDGEMENT, RECOVERY, NOP. 
Jun 25 12:15:34 fwapp003 nagios: SERVICE NOTIFICATION: mpr*;*011;Drive_G: Disk Usage;CUSTOM (OK);notify-service-by-wv_alerts-with-additional-addresses;G:\ - total: 250.00 Gb - used: 19.69 Gb (8%) - free 230.31 Gb (92%);crn*;TEST NOTIFY 
Jun 25 12:15:34 fwapp003 nagios: SERVICE NOTIFICATION: mpr*;*011;Drive_G: Disk Usage;CUSTOM (OK);notify-service-by-email;G:\ - total: 250.00 Gb - used: 19.69 Gb (8%) - free 230.31 Gb (92%);crn*;TEST NOTIFY 
Jun 25 12:15:34 fwapp003 nagios: SERVICE NOTIFICATION: mpr*;*011;Drive_G: Disk Usage;CUSTOM (OK);notify-service-by-email;G:\ - total: 250.00 Gb - used: 19.69 Gb (8%) - free 230.31 Gb (92%);crn*;TEST NOTIFY 
Jun 25 12:15:34 fwapp003 nagios: SERVICE NOTIFICATION: mpr*;*011;Drive_G: Disk Usage;CUSTOM (OK);xi_service_notification_handler;G:\ - total: 250.00 Gb - used: 19.69 Gb (8%) - free 230.31 Gb (92%);crn*;TEST NOTIFY 
Jun 25 12:15:34 fwapp003 nagios: SERVICE NOTIFICATION: eca*;*011;Drive_G: Disk Usage;CUSTOM (OK);notify-service-by-email;G:\ - total: 250.00 Gb - used: 19.69 Gb (8%) - free 230.31 Gb (92%);crn*;TEST NOTIFY 
Jun 25 12:15:35 fwapp003 xinetd[3580]: START: nrpe pid=704 from=*.*.*.10
Jun 25 12:15:35 fwapp003 xinetd[3580]: EXIT: nrpe status=0 pid=704 duration=0(sec)

Re: Notifications not being sent to all members of Contact G

Posted: Thu Jun 26, 2014 6:40 pm
by Box293

Code: Select all

Jun 25 12:15:33 fwapp003 nagios: SERVICE NOTIFICATION: pagerduty_MSSQL_PROD;*011;Drive_G: Disk Usage;CUSTOM (OK);notify-service-by-pagerduty;G:\ - total: 250.00 Gb - used: 19.69 Gb (8%) - free 230.31 Gb (92%);crn*;TEST NOTIFY

Jun 25 12:15:34 fwapp003 pagerduty_nagios[626]: Nagios event in file /tmp/pagerduty_nagios/pd_1403723733_626.txt REJECTED by the PagerDuty server.  Server says: The NOTIFICATIONTYPE field must be present and must be one of: PROBLEM, ACKNOWLEDGEMENT, RECOVERY, NOP. 
Thats what we are looking for. It looks like the PagerDuty server is rejecting the message being sent to it.

Check the command notify-service-by-pagerduty and ensure that it is correclty submitting the NOTIFICATIONTYPE field as one of PROBLEM, ACKNOWLEDGEMENT, RECOVERY, NOP.

Re: Notifications not being sent to all members of Contact G

Posted: Sun Jun 29, 2014 11:39 pm
by crnelson

Code: Select all

Server says: The NOTIFICATIONTYPE field must be present and must be one of: PROBLEM, ACKNOWLEDGEMENT, RECOVERY, NOP.
This is due to it being a forced notification with the state of OK, which doesn't match any of the parameters PagerDuty is expecting.
The following line is expected because I sent an OK notification. What I wanted to see was the contact trigger, which it did.
"Jun 25 12:15:34 fwapp003 pagerduty_nagios[626]: Nagios event in file /tmp/pagerduty_nagios/pd_1403723733_626.txt REJECTED by the PagerDuty server. Server says: The NOTIFICATIONTYPE field must be present and must be one of: PROBLEM, ACKNOWLEDGEMENT, RECOVERY, NOP. "
Do you see in the log that the notification was NOT sent to PagerDuty when the service went critical? See post Thu Jun 26, 2014 2:29 pm. This is what I'm trying to get to the bottom of.