URGENT! Nagios flooding mail server but mailq is empty!

dfmco · Post by **dfmco** » Fri Mar 31, 2017 4:07 pm

We had an outage that generated alerts on all devices and services on our Nagios server. Problem is that over 2 hours after the last recovery in the Nagios console, we are still getting down notifications from nagios. I figured a queue was backed up and I would just need to clear but I find no waiting mail on the Nagios server (mailq) and no mail backed up on the Excahgne server (all mail queues empty). Where else should I look?

I did manage to verify it is the Nagios server. I shut down the server and no more emails come out. As soon as I start the server, notifications resume. Where else should I be looking for outbound notifications besides mailq?

I had the server shut down as I was overheating tech's phones due to the number of messages but I had to re-enable due to contractual requirements.

dfmco · Post by **dfmco** » Fri Mar 31, 2017 4:14 pm

I am using SMTP but I can confirm that if I stop postfix, mail keeps flowing. I did a search on smtp mail queues and phpmailer but nothing of value is returned.

dfmco · Post by **dfmco** » Fri Mar 31, 2017 4:30 pm

Here is the offending process but I have no idea where to go to clean up what is queued!
[root@bmcap-nagios01 mail]# lsof -nPi tcp:25
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
php 7655 nagios 6u IPv4 208934 0t0 TCP 10.4.199.12:59706->216.230.228.195:25 (ESTABLISHED)

dfmco · Post by **dfmco** » Fri Mar 31, 2017 4:40 pm

I see no documentation that I can find on phpmailer. Is this a bug in software? I am getting an alert every minute. My concern is that the queue is a phantom and due to the huge delays in these messages, it is making me wonder how long messages are delayed for my production outages. My outage ended over 4 hours ago but I am still getting down alerts and the Nagios console also shows no alerts for over 4 hours! What gives?

dfmco · Post by **dfmco** » Fri Mar 31, 2017 4:42 pm

64 bit OVA from Nagios on ESXi

dfmco · Post by **dfmco** » Sun Apr 02, 2017 8:07 am

When using SMTP as the mail method, I see php mailing directly (not using postfix). Where is this mail queued? I have had many instances where we will have a large outage and we will get mail for several hours afterward with no way to stop the flood of email.

When the problem above happens, there is usually a delay of 1 minute between emails. Is this a php issue? Where are the timers located so this can be tightened up?

I have a number of issues with flapping on T1 interfaces when monitoring routers. I read the flapping documentation but I am not clear on how to modify the timers to fit my needs. I would like flapping detection to kick in when an interface fails up to 5 minute intervals (or longer). How would I accomplish this in a supportable fashion?

For the flapping issue above, can this be done for an SNMP Traps as well since the traps are generated by the monitored device? We have had many instances of a T1 flapping several times a minute which caused techs phones to overheat due to the sheer number of messages.

While investigating best practices, is there an easy way to spread checks over a longer period of time? Right now, every device checks ever 2 minutes. I don't see a way to make that a second interval in the gui and setting some checks to 3, 4 or 5 minutes will not meet our SLA with the client. How do you suggest we spread checks over time so that we don't queue them up all on the same interval of time?

dwhitfield · Post by **dwhitfield** » Mon Apr 03, 2017 8:12 am

I see you have a ticket about this issue, so I am going to lock the thread.

EDIT: Unlocking due to request from customer.

dflick · Post by **dflick** » Mon Apr 03, 2017 9:28 am

To recap:

Nagios lost connectivity to the network which caused all services and hosts to alert.

Alerts are still being sent from Nagios over 2 hours after recovery (Nagios showed all clear at 12:33PM but alerts still coming at 2pm)

Verified that Nagios server was the source by shutting off the server which killed the alarms. Re-tested by bringing up the server and shutting off Postfix (mailq was empty). With Postfix shut off, the server was still sending alarms. Used LSOF to verify that php was sending the alerts to my mail server. Finally used a firewall rule to block port 25 to my mail server from Nagios to stop the alerts.

I still need to clear the backlog of alerts to prevent techs phones from overheating due to the excessive alerts. We were still getting "down" alerts so all the "up" alerts have not even started yet.

avandemore · Post by **avandemore** » Mon Apr 03, 2017 11:01 am

Where is this mail queued?

On the MTA you configured in SMTP settings.

Is this a php issue?

We'd need more information to speculate or resolve this.

I would like flapping detection to kick in when an interface fails up to 5 minute intervals (or longer). How would I accomplish this in a supportable fashion?

This really should be a separate thread as it's a completely different topic. I will say I don't understand the question though as flap detection is for when a service toggles between a good and bad state during a short period of time.

For the flapping issue above, can this be done for an SNMP Traps as well since the traps are generated by the monitored device?

Yes services tied to SNMP traps are subject to flapping logic.
You can increase the granularity for the timeperiod in nagios.cfg, but that not recommended.
https://assets.nagios.com/downloads/nag ... gmain.html

dfmco · Post by **dfmco** » Mon Apr 03, 2017 12:18 pm

Could you clarify your responses?

I have verified with lsof that php is the process that is sending the mail. It is not being cached on the MTA on the Nagios server where I expect to see the mail (mailq). Where on the Nagios server is the mail held that is sent by php? I need the specific queue or spool file so I can clean up the old mail.

I noticed that the individual mails come out about once a minute. Why would I have such a long delay between mails?

For the flapping, it is typical for a T1 interface to flap once every 2 minutes as it tries to restart itself. I would like Nagios to detect this as a flap and suppress alerts. Currently I think flap detection works on a much smaller interval as I will get alerts ever 2 minutes until the circuit is fixed. The documentation for flap detection was hard for me to follow and I was hoping that a practical example may help. I would like for flap detection to treat failures within a 5 minute interval as the same outage and not alert. Is that possible? The reason I mention traps is that a serial is tied to a single service but a trap could send an alert for multiple services (serial interface down AND Gig interface down AND memory utilization, etc) so I was not sure how to configure flap detection to sense a single service as a trap (serial interface fails and recovers several times but ignore the gig interface and memory traps as related to flap detection).

Does that make sense?

Nagios Support Forum

URGENT! Nagios flooding mail server but mailq is empty!

URGENT! Nagios flooding mail server but mailq is empty!

Re: URGENT! Nagios flooding mail server but mailq is empty!

Re: URGENT! Nagios flooding mail server but mailq is empty!

Re: URGENT! Nagios flooding mail server but mailq is empty!

Re: URGENT! Nagios flooding mail server but mailq is empty!

Couple of questions regarding notifications

Re: URGENT! Nagios flooding mail server but mailq is empty!

Re: URGENT! Nagios flooding mail server but mailq is empty!

Re: Couple of questions regarding notifications

Re: Couple of questions regarding notifications