Isn't this what the 'Send custom service notification' and 'Send custom host notification' are for?
I may be missing something, but it sounds like the mail server is being used outside of nagios when everything that's needed is inside.
As Andy said, nagios should be sending out an alert to all the valid people for a service/host when that check goes critical; it also sends out a notification when a host/service is acknowledged in nagios. When the service or host recovers, there is a third notification that goes out saying all is better.
If you aren't getting those 3 notifications, then it sounds like the contact or the service/host definition is setup to not send them?
Check the notification_options for the services and hosts - your hosts generally should at least a d,u,r (down, unreachable, recovery) without an 'n' option (that'd be none); services should have at least c,u,r (critical, unknown, recovery) or w,c,u,r (notify for warnings too) - again, no 'n'.
You can determine who has been notified from within nagios - its on the lower left part of the nav bar in the Reports section, titled 'Notifications'. But, you shouldn't really need to look there, the critical/acknowledged/recovery notices should already be going to who they need to go to if you set the service checks right.
If you need to send an additional bit of info about a specific host or service, you can click on a host or a service and over on the right side in the Host Commands or Service Commands box (you know, the place where you acknowledge the alerts) there's a link for 'Send custom host/service notification'. That should send out a notice to the valid recipients (pay attention to the command description - you can use the Forced box to send out regardless of time restrictions, the broadcast box allows sending to include escalated contacts)
This command is used to send a custom notification about the specified host. Useful in emergencies when you need to notify admins of an issue regarding a monitored system or service. Custom notifications normally follow the regular notification logic in Nagios. Selecting the Forced option will force the notification to be sent out, regardless of the time restrictions, whether or not notifications are enabled, etc. Selecting the Broadcast option causes the notification to be sent out to all normal (non-escalated) and escalated contacts. These options allow you to override the normal notification logic if you need to get an important message out.
I can't find the specific reference, but I'm pretty sure that feature showed up around nagios3.0
Generally, my alerts follow this flow:
Nagios send critical alert to all contacts. (nagios mail #1)
I (or whoever gets paged first and is on call) logs in to the nagios box, picks out the host or service and acknowledged it "Ack -A Looks like the gibson is down" which gets mailed out automatically to everyone who got the critical alerts. (nagios mail #2)
I open a ticket in my ticketing system to take notes (and track sweet billable time) and jot down the ticket number. (generates some of its own mail)
The gibson is fixed, nagios sees it and recovers, sending out a new notification to everyone who saw that I'd acknowledged it. (nagios mail #3)
Since this was the gibson that went down, I get back on nagios and send out a custom notification that just says "Gibson looks fixed, details in ticket#12345" and everyone who cares about it can hunt it down later. (nagios mail #4 - when needed)
It lets the monitoring system be a monitoring and alert system and lets your ticket system be your ticket system...
Give a
a quick read of how nagios notification work - the who gets notified section matters. The fact that you are sending out what sounds effectively like an 'acknowledgment' manually makes me suspect that a key feature of nagios may have been missed somewhere in your organizations history? Either that or I'm not fully comprehending the problem...