nagios bug

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
solarflow
Posts: 3
Joined: Thu Dec 26, 2013 7:40 pm

nagios bug

Post by solarflow »

I think I might have found a bug in nagios, this is version 3.4.1 however. What happens is that when a host goes down, the notification tries to send but times out, yet the host UP notifications always work so I can't see it being a sendmail problem. I've tried many things and still can't find the problem, increasing the timeout to 600 doesn't help. Running the command from linux works perfectly. Here's what the logs show:

HOST NOTIFICATION: nagiosadmin;Router;DOWN;notify-host-by-email;(Host Check Timed Out)
[1388111250] Warning: Contact 'nagiosadmin' host notification command '/usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\nHost: Router\nState: DOWN\nAddress: 192.168.2.10\nInfo: (Host Check Timed Out)\n\nDate/Time: Thu Dec 26 21:26:29 EST 2013\n" | /bin/mail -s "** PROBLEM Host Alert: Router is DOWN **" [email protected]' timed out after 60 seconds


And if I set host_notification_options = n then it just fails on the service_notification instead:

SERVICE NOTIFICATION: nagiosadmin;hp1810-SW;Port 1 Link Status;CRITICAL;notify-service-by-email;SNMP CRITICAL - *down(2)*
[1388120076] Warning: Contact 'nagiosadmin' service notification command '/usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\n\nService: Port 1 Link Status\nHost: hp1810-8g\nAddress: 192.168.10.2\nState: CRITICAL\n\nDate/Time: Thu Dec 26 23:53:35 EST 2013\n\nAdditional Info:\n\nSNMP CRITICAL - *down(2)*\n" | /bin/mail -s "** PROBLEM Service Alert: hp1810-8g/Port 1 Link Status is CRITICAL **" [email protected]' timed out after 60 seconds


As soon as connectivity is restored, all the recovery emails come in.

in Templates.cfg:

service_notification_options w,u,c,r,f,s
host_notification_options d,u,r,f,s
service_notification_commands notify-service-by-email
host_notification_commands notify-host-by-email
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: nagios bug

Post by tmcdonald »

Have you filed a bug report for this yet? It certainly does not seem like expected behavior. Can you expand on "Running the command from linux works perfectly"? Do you mean running the whole /bin/mail command works?
Former Nagios employee
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: nagios bug

Post by slansing »

Can you show us the output from your maillog of the mail actually timing out?
solarflow
Posts: 3
Joined: Thu Dec 26, 2013 7:40 pm

Re: nagios bug

Post by solarflow »

I haven't files a bug report since I thought I should make sure it was really a bug or if someone else has the same problem. Here is the output from maillog and what happens when I run the whole command from linux:

sendmail[17867]: rBRKZGBl017867: from=nagios, size=0, class=0, nrcpts=0, relay=nagios@localhost


And here is when the recovery emails come in:

sendmail[18190]: rBRKeS6Y018190: from=nagios, size=430, class=0, nrcpts=1, msgid=<[email protected]>, relay=nagios@localhost
solarflow sendmail[18191]: rBRKeSVd018191: from=<[email protected]>, size=673, class=0, nrcpts=1, msgid=<[email protected]>, proto=ESMTP, daemon=MTA, relay=localhost [127.0.0.1]
solarflow sendmail[18190]: rBRKeS6Y018190: to=[email protected], ctladdr=nagios (496/496), delay=00:00:00, xdelay=00:00:00, mailer=relay, pri=30430, relay=[127.0.0.1] [127.0.0.1], dsn=2.0.0, stat=Sent (rBRKeSVd018191 Message accepted for delivery)


And here is when I run it from the command line:

$ /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\nHost: Router\nState: DOWN\nAddress: 192.168.2.10\nInfo: (Host Check Timed Out)\n\nDate/Time: Fri Dec 27 15:35:16 EST 2013\n" | /bin/mail -s "** PROBLEM Host Alert: Router is DOWN **" [email protected]

$ mail
Heirloom Mail version 12.4 7/29/08. Type ? for help.
"/var/spool/mail/root": 2 messages 1 new
1 Mail System Internal Fri Dec 27 15:45 13/544 "DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA"
>N 2 root Fri Dec 27 15:45 28/931 "** PROBLEM Host Alert: Router is DOWN **"
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: nagios bug

Post by scottwilkerson »

In your first message it looks like you have a single ' after the email address followed by some other info. This doesn't seem to beproperly formatted, can you post your notify-host-by-email command
HOST NOTIFICATION: nagiosadmin;Router;DOWN;notify-host-by-email;(Host Check Timed Out)
[1388111250] Warning: Contact 'nagiosadmin' host notification command '/usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\nHost: Router\nState: DOWN\nAddress: 192.168.2.10\nInfo: (Host Check Timed Out)\n\nDate/Time: Thu Dec 26 21:26:29 EST 2013\n" | /bin/mail -s "** PROBLEM Host Alert: Router is DOWN **" [email protected]' timed out after 60 seconds
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
solarflow
Posts: 3
Joined: Thu Dec 26, 2013 7:40 pm

Re: nagios bug

Post by solarflow »

Just to provide some closure to this issue, the problem seems to stem from DNS not being available. So something about the way sendmail delivers messages locally even with host entires in /etc/hosts it still queries DNS anyways, if it can't reach it nothing goes in the mailq and silently fails. In my tests sendmail would not devilver unless it got a NXDOMAN response. There's probably a configuration option to change this, but postfix seemed to handle it better, and it's listed to replace sendmail as the default MTA in rhel and fedora anyways.

Thaks for the help ...
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: nagios bug

Post by tmcdonald »

Thanks for getting back to us! Glad to see you got it working. Yea, postfix seems to be the preferred MTA these days, so I'm not surprised. Good to see some empirical evidence though.

I'm going to lock this up now, but feel free to open another if you have more questions.
Former Nagios employee
Locked