Delayed Alerts
Delayed Alerts
We've recently run into an issue where alerts show up several (sometimes as long as 45-60) minutes.
When we restart Nagios (or do an apply) it appears to fix itself for a while; but it appears to happen approximately 24 hours.
We're running Nagios XI 5.2.9 on a Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz with 32 cores.
When we restart Nagios (or do an apply) it appears to fix itself for a while; but it appears to happen approximately 24 hours.
We're running Nagios XI 5.2.9 on a Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz with 32 cores.
Re: Delayed Alerts
Lets check the time, according to various components, on your XI machine. Run this command and post the output:
date; ntpdate; grep "date.timezone" /etc/php.ini; ls -l /etc/localtime; php -r 'echo date("D M j G:i:s T Y")."\n";'; mysql -uroot -pnagiosxi -e "SELECT NOW();"
Also, is your Nagios system set up to use sendmail or SMTP?
If you're using SMTP then Nagios XI should be sending emails straight out; so no 'spooling' occurs on the Nagios server. The only way the cause of the delay could be at the Nagios Server is if it was bogged down with other tasks (which is possible if you have thousands of service checks). Apart from that, usually we see this has to do with rate limiting or the like, upstream of Nagios - when SMPT is in use. But since you say things are fine until a reboot, then I suspect Nagios may be bogged down - How many total hosts and services are configured on your Nagios Machine?
Admin > System Config > System Profile > Show Profile
- Please post screenshots of Admin > System Status and Monitoring Engine Status
Sounds like you've already done this but, 1st confirm at what time Nagios XI sent these emails by running a notification report for the time period in question. In the UI go to Reports > Available Reports > Notifications. You'll have to compare the time stamps seen here to what was logged on your mail Server. Any large gaps between the two?
Instead of running a report, you can also view the same info here....
Home > Incidents > Notifications
....or you can refer to "nagios.log" - here you will see checks, notifications, external commands, and events:
/usr/local/nagios/var/nagios.log
- thanks -
date; ntpdate; grep "date.timezone" /etc/php.ini; ls -l /etc/localtime; php -r 'echo date("D M j G:i:s T Y")."\n";'; mysql -uroot -pnagiosxi -e "SELECT NOW();"
Also, is your Nagios system set up to use sendmail or SMTP?
If you're using SMTP then Nagios XI should be sending emails straight out; so no 'spooling' occurs on the Nagios server. The only way the cause of the delay could be at the Nagios Server is if it was bogged down with other tasks (which is possible if you have thousands of service checks). Apart from that, usually we see this has to do with rate limiting or the like, upstream of Nagios - when SMPT is in use. But since you say things are fine until a reboot, then I suspect Nagios may be bogged down - How many total hosts and services are configured on your Nagios Machine?
Admin > System Config > System Profile > Show Profile
- Please post screenshots of Admin > System Status and Monitoring Engine Status
Sounds like you've already done this but, 1st confirm at what time Nagios XI sent these emails by running a notification report for the time period in question. In the UI go to Reports > Available Reports > Notifications. You'll have to compare the time stamps seen here to what was logged on your mail Server. Any large gaps between the two?
Instead of running a report, you can also view the same info here....
Home > Incidents > Notifications
....or you can refer to "nagios.log" - here you will see checks, notifications, external commands, and events:
/usr/local/nagios/var/nagios.log
- thanks -
Be sure to check out the Knowledgebase for helpful articles and solutions!
Re: Delayed Alerts
It is, but we're not using SMTP or sendmail for alerts - we're using the Service Status screen (our NOC watches these screens all day long)bwallace wrote:Lets check the time, according to various components, on your XI machine. Run this command and post the output:
date; ntpdate; grep "date.timezone" /etc/php.ini; ls -l /etc/localtime; php -r 'echo date("D M j G:i:s T Y")."\n";'; mysql -uroot -pnagiosxi -e "SELECT NOW();"
Also, is your Nagios system set up to use sendmail or SMTP?Code: Select all
Wed Sep 28 10:44:18 PDT 2016 28 Sep 10:44:18 ntpdate[10330]: no servers can be used, exiting ; http://www.php.net/manual/en/datetime.configuration.php#ini.date.timezone date.timezone = America/Los_Angeles -rw-r--r--. 1 root root 2819 May 13 2014 /etc/localtime Wed Sep 28 10:44:18 PDT 2016 +---------------------+ | now() | +---------------------+ | 2016-09-28 10:44:18 | +---------------------+
Total Hosts: 1750bwallace wrote: But since you say things are fine until a reboot, then I suspect Nagios may be bogged down - How many total hosts and services are configured on your Nagios Machine?
Admin > System Config > System Profile > Show Profile
Total Services: 10576
bwallace wrote: Please post screenshots of Admin > System Status and Monitoring Engine Status
Looking in the Home/Incidents/Notifications for the alerts that we were seeing, there is a gap where Nagios says there were no alerts, but from the screen shots I received, there were alerts. I can't redact the sensitive information, so I'll have to send you that in a PM.bwallace wrote: Instead of running a report, you can also view the same info here....
Home > Incidents > Notifications
You do not have the required permissions to view the files attached to this post.
-
- Posts: 1597
- Joined: Tue Sep 27, 2016 4:57 pm
Re: Delayed Alerts
Does output correct time? It's not clear if the info in your previous post was generated close to the same time as the posting.
It is possible your Nagios system isn't keeping time correctly. Generally configuring NTP will keep a system's time accurate enough for most uses, but some edge cases can make it fail. Typical failure points can be things like cheap/malfunctioning HW, Hypervisor -> Guest not playing well together, or even certain kernel configuration eg tickless kernel's.
You can check if NTP is running by using:
Code: Select all
date
It is possible your Nagios system isn't keeping time correctly. Generally configuring NTP will keep a system's time accurate enough for most uses, but some edge cases can make it fail. Typical failure points can be things like cheap/malfunctioning HW, Hypervisor -> Guest not playing well together, or even certain kernel configuration eg tickless kernel's.
You can check if NTP is running by using:
Code: Select all
service ntpd status
Previous Nagios employee
Re: Delayed Alerts
The date is the same for all the commands issued above - it was a cutting/pasting error on my part.avandemore wrote:Doesoutput correct time?Code: Select all
date
it's running on a ProLiant DL380p Gen8 and ntpd is running (and has been).avandemore wrote:It is possible your Nagios system isn't keeping time correctly. Generally configuring NTP will keep a system's time accurate enough for most uses, but some edge cases can make it fail. Typical failure points can be things like cheap/malfunctioning HW, Hypervisor -> Guest not playing well together, or even certain kernel configuration eg tickless kernel's.
You can check if NTP is running by using:Code: Select all
service ntpd status
I also checked the time on our external MySQL server and it's time is in sync with the Nagios server as well.
Re: Delayed Alerts
About those screenshots you PM'ed me - thanks and I'm replying here so everyone else can remain in the loop.
Now an alert does not necessarily mean a notification will be sent. Just need to clarify that because the two hosts seen in screenshot #1 for the alerts are not seen in screenshot #2 for notifications.
If notifications are not configured for those 2 hosts then this is expected.
Could you double check to make sure notifications are enabled for these hosts?
If they are, then we need to see this in the notifications page - take note of the time stamp you see there - that will be key.
As a side note, you should seriously considering implementing a RAM disk, given how many service checks you have running:
https://assets.nagios.com/downloads/nag ... giosXI.pdf
Utilizing a RAM Disk is one of the first steps in improving performance.
- thanks -
Now an alert does not necessarily mean a notification will be sent. Just need to clarify that because the two hosts seen in screenshot #1 for the alerts are not seen in screenshot #2 for notifications.
If notifications are not configured for those 2 hosts then this is expected.
Could you double check to make sure notifications are enabled for these hosts?
If they are, then we need to see this in the notifications page - take note of the time stamp you see there - that will be key.
As a side note, you should seriously considering implementing a RAM disk, given how many service checks you have running:
https://assets.nagios.com/downloads/nag ... giosXI.pdf
Utilizing a RAM Disk is one of the first steps in improving performance.
- thanks -
Be sure to check out the Knowledgebase for helpful articles and solutions!
Re: Delayed Alerts
Notifications are enabled for those hosts; and to be clear, it's not just those hosts that have issues - it happens to all hosts, those were just the ones that we captured that day. This morning, I have 3 different hosts (which I sent you a PM with the screen capture).bwallace wrote:Now an alert does not necessarily mean a notification will be sent. Just need to clarify that because the two hosts seen in screenshot #1 for the alerts are not seen in screenshot #2 for notifications.
If notifications are not configured for those 2 hosts then this is expected.
Could you double check to make sure notifications are enabled for these hosts?
If they are, then we need to see this in the notifications page - take note of the time stamp you see there - that will be key.
The delay is pretty apparent in that PM that I sent you.
But for those folks who don't have the visual aids, we had several alerts show up at 21:30 saying that several checks (on several different hosts) had failed 30 minutes ago, but they are just now appearing on the Status Summary page. Within 2-3 minutes later, the checks all disappeared. When we drilled down into those service checks that were alerting just 2-3 minutes prior, now show that the checks have had an OK status for the past 32-33 minutes.
I will certainly look into implementing a RAM disk ASAP.bwallace wrote: As a side note, you should seriously considering implementing a RAM disk, given how many service checks you have running:
https://assets.nagios.com/downloads/nag ... giosXI.pdf
Utilizing a RAM Disk is one of the first steps in improving performance.
Re: Delayed Alerts
I understand now - thanks for clarifying. Can you PM me a profile? This will enable us to take a deeper look at the config, logs, etc which really seems necessary at this point.
EDIT: Profile received.
EDIT: Profile received.
Be sure to check out the Knowledgebase for helpful articles and solutions!
Re: Delayed Alerts
Sent.bwallace wrote:I understand now - thanks for clarifying. Can you PM me a profile? This will enable us to take a deeper look at the config, logs, etc which really seems necessary at this point.
Re: Delayed Alerts
Looks like your DB is offloaded, and I believe this issue could be related to timestamps. Could you run the following commands on both the XI machine, and the offloaded DB machine? Replace the SQL credentials as needed for the SQL machine. Some of these may fail, and that's fine. -
Code: Select all
grep "date.timezone" /etc/php.ini
ls -l /etc/localtime
php -r 'echo date("D M j G:i:s T Y")."\n";'
date
mysql -uroot-pn@gweb -e "SELECT NOW();"
Former Nagios Employee