Delayed Alerts

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
highness
Posts: 192
Joined: Thu May 01, 2014 4:25 pm

Delayed Alerts

Post by highness »

We've recently run into an issue where alerts show up several (sometimes as long as 45-60) minutes.

When we restart Nagios (or do an apply) it appears to fix itself for a while; but it appears to happen approximately 24 hours.

We're running Nagios XI 5.2.9 on a Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz with 32 cores.
bwallace
Posts: 1146
Joined: Tue Nov 17, 2015 1:57 pm

Re: Delayed Alerts

Post by bwallace »

Lets check the time, according to various components, on your XI machine. Run this command and post the output:
date; ntpdate; grep "date.timezone" /etc/php.ini; ls -l /etc/localtime; php -r 'echo date("D M j G:i:s T Y")."\n";'; mysql -uroot -pnagiosxi -e "SELECT NOW();"

Also, is your Nagios system set up to use sendmail or SMTP?

If you're using SMTP then Nagios XI should be sending emails straight out; so no 'spooling' occurs on the Nagios server. The only way the cause of the delay could be at the Nagios Server is if it was bogged down with other tasks (which is possible if you have thousands of service checks). Apart from that, usually we see this has to do with rate limiting or the like, upstream of Nagios - when SMPT is in use. But since you say things are fine until a reboot, then I suspect Nagios may be bogged down - How many total hosts and services are configured on your Nagios Machine?
Admin > System Config > System Profile > Show Profile

- Please post screenshots of Admin > System Status and Monitoring Engine Status


Sounds like you've already done this but, 1st confirm at what time Nagios XI sent these emails by running a notification report for the time period in question. In the UI go to Reports > Available Reports > Notifications. You'll have to compare the time stamps seen here to what was logged on your mail Server. Any large gaps between the two?

Instead of running a report, you can also view the same info here....
Home > Incidents > Notifications

....or you can refer to "nagios.log" - here you will see checks, notifications, external commands, and events:
/usr/local/nagios/var/nagios.log

- thanks -
Be sure to check out the Knowledgebase for helpful articles and solutions!
highness
Posts: 192
Joined: Thu May 01, 2014 4:25 pm

Re: Delayed Alerts

Post by highness »

bwallace wrote:Lets check the time, according to various components, on your XI machine. Run this command and post the output:
date; ntpdate; grep "date.timezone" /etc/php.ini; ls -l /etc/localtime; php -r 'echo date("D M j G:i:s T Y")."\n";'; mysql -uroot -pnagiosxi -e "SELECT NOW();"

Code: Select all

Wed Sep 28 10:44:18 PDT 2016
28 Sep 10:44:18 ntpdate[10330]: no servers can be used, exiting
; http://www.php.net/manual/en/datetime.configuration.php#ini.date.timezone
date.timezone = America/Los_Angeles
-rw-r--r--. 1 root root 2819 May 13  2014 /etc/localtime
Wed Sep 28 10:44:18 PDT 2016
+---------------------+
| now()               |
+---------------------+
| 2016-09-28 10:44:18 |
+---------------------+
Also, is your Nagios system set up to use sendmail or SMTP?
It is, but we're not using SMTP or sendmail for alerts - we're using the Service Status screen (our NOC watches these screens all day long)
bwallace wrote: But since you say things are fine until a reboot, then I suspect Nagios may be bogged down - How many total hosts and services are configured on your Nagios Machine?
Admin > System Config > System Profile > Show Profile
Total Hosts: 1750
Total Services: 10576
bwallace wrote: Please post screenshots of Admin > System Status and Monitoring Engine Status
Screen Shot 2016-09-28 at 10.49.52 AM.png
bwallace wrote: Instead of running a report, you can also view the same info here....
Home > Incidents > Notifications
Looking in the Home/Incidents/Notifications for the alerts that we were seeing, there is a gap where Nagios says there were no alerts, but from the screen shots I received, there were alerts. I can't redact the sensitive information, so I'll have to send you that in a PM.
You do not have the required permissions to view the files attached to this post.
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: Delayed Alerts

Post by avandemore »

Does

Code: Select all

date
output correct time? It's not clear if the info in your previous post was generated close to the same time as the posting.

It is possible your Nagios system isn't keeping time correctly. Generally configuring NTP will keep a system's time accurate enough for most uses, but some edge cases can make it fail. Typical failure points can be things like cheap/malfunctioning HW, Hypervisor -> Guest not playing well together, or even certain kernel configuration eg tickless kernel's.

You can check if NTP is running by using:

Code: Select all

service ntpd status
Previous Nagios employee
highness
Posts: 192
Joined: Thu May 01, 2014 4:25 pm

Re: Delayed Alerts

Post by highness »

avandemore wrote:Does

Code: Select all

date
output correct time?
The date is the same for all the commands issued above - it was a cutting/pasting error on my part.

avandemore wrote:It is possible your Nagios system isn't keeping time correctly. Generally configuring NTP will keep a system's time accurate enough for most uses, but some edge cases can make it fail. Typical failure points can be things like cheap/malfunctioning HW, Hypervisor -> Guest not playing well together, or even certain kernel configuration eg tickless kernel's.

You can check if NTP is running by using:

Code: Select all

service ntpd status
it's running on a ProLiant DL380p Gen8 and ntpd is running (and has been).

I also checked the time on our external MySQL server and it's time is in sync with the Nagios server as well.
bwallace
Posts: 1146
Joined: Tue Nov 17, 2015 1:57 pm

Re: Delayed Alerts

Post by bwallace »

About those screenshots you PM'ed me - thanks and I'm replying here so everyone else can remain in the loop.

Now an alert does not necessarily mean a notification will be sent. Just need to clarify that because the two hosts seen in screenshot #1 for the alerts are not seen in screenshot #2 for notifications.
If notifications are not configured for those 2 hosts then this is expected.
Could you double check to make sure notifications are enabled for these hosts?
If they are, then we need to see this in the notifications page - take note of the time stamp you see there - that will be key.

As a side note, you should seriously considering implementing a RAM disk, given how many service checks you have running:
https://assets.nagios.com/downloads/nag ... giosXI.pdf
Utilizing a RAM Disk is one of the first steps in improving performance.

- thanks -
Be sure to check out the Knowledgebase for helpful articles and solutions!
highness
Posts: 192
Joined: Thu May 01, 2014 4:25 pm

Re: Delayed Alerts

Post by highness »

bwallace wrote:Now an alert does not necessarily mean a notification will be sent. Just need to clarify that because the two hosts seen in screenshot #1 for the alerts are not seen in screenshot #2 for notifications.

If notifications are not configured for those 2 hosts then this is expected.

Could you double check to make sure notifications are enabled for these hosts?
If they are, then we need to see this in the notifications page - take note of the time stamp you see there - that will be key.
Notifications are enabled for those hosts; and to be clear, it's not just those hosts that have issues - it happens to all hosts, those were just the ones that we captured that day. This morning, I have 3 different hosts (which I sent you a PM with the screen capture).

The delay is pretty apparent in that PM that I sent you.

But for those folks who don't have the visual aids, we had several alerts show up at 21:30 saying that several checks (on several different hosts) had failed 30 minutes ago, but they are just now appearing on the Status Summary page. Within 2-3 minutes later, the checks all disappeared. When we drilled down into those service checks that were alerting just 2-3 minutes prior, now show that the checks have had an OK status for the past 32-33 minutes.
bwallace wrote: As a side note, you should seriously considering implementing a RAM disk, given how many service checks you have running:
https://assets.nagios.com/downloads/nag ... giosXI.pdf
Utilizing a RAM Disk is one of the first steps in improving performance.
I will certainly look into implementing a RAM disk ASAP.
bwallace
Posts: 1146
Joined: Tue Nov 17, 2015 1:57 pm

Re: Delayed Alerts

Post by bwallace »

I understand now - thanks for clarifying. Can you PM me a profile? This will enable us to take a deeper look at the config, logs, etc which really seems necessary at this point.

EDIT: Profile received.
Be sure to check out the Knowledgebase for helpful articles and solutions!
highness
Posts: 192
Joined: Thu May 01, 2014 4:25 pm

Re: Delayed Alerts

Post by highness »

bwallace wrote:I understand now - thanks for clarifying. Can you PM me a profile? This will enable us to take a deeper look at the config, logs, etc which really seems necessary at this point.
Sent.
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Delayed Alerts

Post by rkennedy »

Looks like your DB is offloaded, and I believe this issue could be related to timestamps. Could you run the following commands on both the XI machine, and the offloaded DB machine? Replace the SQL credentials as needed for the SQL machine. Some of these may fail, and that's fine. -

Code: Select all

grep "date.timezone" /etc/php.ini
ls -l /etc/localtime
php -r 'echo date("D M j G:i:s T Y")."\n";'
date
mysql -uroot-pn@gweb -e "SELECT NOW();"
Former Nagios Employee
Locked