I think I might have found a bug in nagios, this is version 3.4.1 however. What happens is that when a host goes down, the notification tries to send but times out, yet the host UP notifications always work so I can't see it being a sendmail problem. I've tried many things and still can't find the problem, increasing the timeout to 600 doesn't help. Running the command from linux works perfectly. Here's what the logs show:
HOST NOTIFICATION: nagiosadmin;Router;DOWN;notify-host-by-email;(Host Check Timed Out)
[1388111250] Warning: Contact 'nagiosadmin' host notification command '/usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\nHost: Router\nState: DOWN\nAddress: 192.168.2.10\nInfo: (Host Check Timed Out)\n\nDate/Time: Thu Dec 26 21:26:29 EST 2013\n" | /bin/mail -s "** PROBLEM Host Alert: Router is DOWN **" [email protected]' timed out after 60 seconds
And if I set host_notification_options = n then it just fails on the service_notification instead:
SERVICE NOTIFICATION: nagiosadmin;hp1810-SW;Port 1 Link Status;CRITICAL;notify-service-by-email;SNMP CRITICAL - *down(2)*
[1388120076] Warning: Contact 'nagiosadmin' service notification command '/usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\n\nService: Port 1 Link Status\nHost: hp1810-8g\nAddress: 192.168.10.2\nState: CRITICAL\n\nDate/Time: Thu Dec 26 23:53:35 EST 2013\n\nAdditional Info:\n\nSNMP CRITICAL - *down(2)*\n" | /bin/mail -s "** PROBLEM Service Alert: hp1810-8g/Port 1 Link Status is CRITICAL **" [email protected]' timed out after 60 seconds
As soon as connectivity is restored, all the recovery emails come in.
in Templates.cfg:
service_notification_options w,u,c,r,f,s
host_notification_options d,u,r,f,s
service_notification_commands notify-service-by-email
host_notification_commands notify-host-by-email
nagios bug
Re: nagios bug
Have you filed a bug report for this yet? It certainly does not seem like expected behavior. Can you expand on "Running the command from linux works perfectly"? Do you mean running the whole /bin/mail command works?
Former Nagios employee
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: nagios bug
Can you show us the output from your maillog of the mail actually timing out?
Re: nagios bug
I haven't files a bug report since I thought I should make sure it was really a bug or if someone else has the same problem. Here is the output from maillog and what happens when I run the whole command from linux:
sendmail[17867]: rBRKZGBl017867: from=nagios, size=0, class=0, nrcpts=0, relay=nagios@localhost
And here is when the recovery emails come in:
sendmail[18190]: rBRKeS6Y018190: from=nagios, size=430, class=0, nrcpts=1, msgid=<[email protected]>, relay=nagios@localhost
solarflow sendmail[18191]: rBRKeSVd018191: from=<[email protected]>, size=673, class=0, nrcpts=1, msgid=<[email protected]>, proto=ESMTP, daemon=MTA, relay=localhost [127.0.0.1]
solarflow sendmail[18190]: rBRKeS6Y018190: to=[email protected], ctladdr=nagios (496/496), delay=00:00:00, xdelay=00:00:00, mailer=relay, pri=30430, relay=[127.0.0.1] [127.0.0.1], dsn=2.0.0, stat=Sent (rBRKeSVd018191 Message accepted for delivery)
And here is when I run it from the command line:
$ /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\nHost: Router\nState: DOWN\nAddress: 192.168.2.10\nInfo: (Host Check Timed Out)\n\nDate/Time: Fri Dec 27 15:35:16 EST 2013\n" | /bin/mail -s "** PROBLEM Host Alert: Router is DOWN **" [email protected]
$ mail
Heirloom Mail version 12.4 7/29/08. Type ? for help.
"/var/spool/mail/root": 2 messages 1 new
1 Mail System Internal Fri Dec 27 15:45 13/544 "DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA"
>N 2 root Fri Dec 27 15:45 28/931 "** PROBLEM Host Alert: Router is DOWN **"
sendmail[17867]: rBRKZGBl017867: from=nagios, size=0, class=0, nrcpts=0, relay=nagios@localhost
And here is when the recovery emails come in:
sendmail[18190]: rBRKeS6Y018190: from=nagios, size=430, class=0, nrcpts=1, msgid=<[email protected]>, relay=nagios@localhost
solarflow sendmail[18191]: rBRKeSVd018191: from=<[email protected]>, size=673, class=0, nrcpts=1, msgid=<[email protected]>, proto=ESMTP, daemon=MTA, relay=localhost [127.0.0.1]
solarflow sendmail[18190]: rBRKeS6Y018190: to=[email protected], ctladdr=nagios (496/496), delay=00:00:00, xdelay=00:00:00, mailer=relay, pri=30430, relay=[127.0.0.1] [127.0.0.1], dsn=2.0.0, stat=Sent (rBRKeSVd018191 Message accepted for delivery)
And here is when I run it from the command line:
$ /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\nHost: Router\nState: DOWN\nAddress: 192.168.2.10\nInfo: (Host Check Timed Out)\n\nDate/Time: Fri Dec 27 15:35:16 EST 2013\n" | /bin/mail -s "** PROBLEM Host Alert: Router is DOWN **" [email protected]
Heirloom Mail version 12.4 7/29/08. Type ? for help.
"/var/spool/mail/root": 2 messages 1 new
1 Mail System Internal Fri Dec 27 15:45 13/544 "DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA"
>N 2 root Fri Dec 27 15:45 28/931 "** PROBLEM Host Alert: Router is DOWN **"
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: nagios bug
In your first message it looks like you have a single ' after the email address followed by some other info. This doesn't seem to beproperly formatted, can you post your notify-host-by-email command
HOST NOTIFICATION: nagiosadmin;Router;DOWN;notify-host-by-email;(Host Check Timed Out)
[1388111250] Warning: Contact 'nagiosadmin' host notification command '/usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\nHost: Router\nState: DOWN\nAddress: 192.168.2.10\nInfo: (Host Check Timed Out)\n\nDate/Time: Thu Dec 26 21:26:29 EST 2013\n" | /bin/mail -s "** PROBLEM Host Alert: Router is DOWN **" [email protected]' timed out after 60 seconds
Re: nagios bug
Just to provide some closure to this issue, the problem seems to stem from DNS not being available. So something about the way sendmail delivers messages locally even with host entires in /etc/hosts it still queries DNS anyways, if it can't reach it nothing goes in the mailq and silently fails. In my tests sendmail would not devilver unless it got a NXDOMAN response. There's probably a configuration option to change this, but postfix seemed to handle it better, and it's listed to replace sendmail as the default MTA in rhel and fedora anyways.
Thaks for the help ...
Thaks for the help ...
Re: nagios bug
Thanks for getting back to us! Glad to see you got it working. Yea, postfix seems to be the preferred MTA these days, so I'm not surprised. Good to see some empirical evidence though.
I'm going to lock this up now, but feel free to open another if you have more questions.
I'm going to lock this up now, but feel free to open another if you have more questions.
Former Nagios employee