Hi Support,
We had situation on 20th Sept 2019 where alert for 4 DB File Systems was generated however with around 14 mins delay.
Case:
FS had reached to 100% at 14:32 HKT.
Polling interval is 5 mins and 6 lags hence notification is expected to arise in nagios at 15:02 ( 5 * 3) however the notifications in nagios received between 15:16 to 15:17 HKT.
Details:
1) FS utilization from server end at 14:32 HKT:
=== (190920) Fri Sep 20 14:32:01 HKT 2019 LIST /usr/semasupp/ctl/CHKDF.TXT ===
/db2/temp2/myebbs/ebbsdbtemp2 14990 MB, used 14974 MB, free 16 MB 100
/db2/temp1/myebbs/ebbsdbtemp1 14990 MB, used 14974 MB, free 16 MB 100
/db2/temp3/myebbs/ebbsdbtemp3 14990 MB, used 14974 MB, free 16 MB 100
/db2/temp4/myebbs/ebbsdbtemp4 14990 MB, used 14974 MB, free 16 MB 100
2) Snapshot for notification in nagios attached
3) Another thing that observed is some VCS ERROR and cleaning observed around 15:07 HKT and post that notification arrived in nagios. Snapshot attached.
Please let me know if you need more details as we need to understand what was the actual reason the alert was delayed and RCA needs to be provided to client for this case, Thanks.
Delayed FS notification in Nagios XI
Delayed FS notification in Nagios XI
You do not have the required permissions to view the files attached to this post.
Re: Delayed FS notification in Nagios XI
Yes, it would probably be good to get some additional information. Could you PM your system profile to me? Just go to Admin -> System Profile -> Download Profile.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Delayed FS notification in Nagios XI
I'm looking at the profile you sent in, and it's missing the log files. That may be from the manual process we needed to do to download the profile. But in any case, the only thing that seems like it would cause a delay in sending an email is the fact that the system load is really high. In the profile, it's showing a load of 14+. I don't know that that's enough to cause a 15 minute delay in sending an email, though.
Do you know if this delay is common, or can it be reproduced? If we could trigger it, and then get a full system profile, we could track down what Nagios was doing at the time.
Do you know if this delay is common, or can it be reproduced? If we could trigger it, and then get a full system profile, we could track down what Nagios was doing at the time.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Delayed FS notification in Nagios XI
Hi Support,
In addition, Along with nagios profile ,have also attached file containing screenshots of template setting ( poll and lag) that the monitoring rule is using, Utilization from server semasupp log, FS threshold setting in nagios for FS and notification occurred in nagios.
This shows that at server the FS was at 92% at 14:22 HKT on 20th Sept however the alert notification in nagios received at 15:16 HKT which is 52 mins gap.
Looking at the template it says that alert should generate after 30 mins by calculating retry interval and max check attempts. Please advice what caused the delay in this scenario.
Also, Would like to check if there is fast track response service we can get for this case as the Client is looking for RCA and we do not want to delay the response, Thanks.
Support note: profile.zip and screenshots.zip downloaded and shared with team.
In addition, Along with nagios profile ,have also attached file containing screenshots of template setting ( poll and lag) that the monitoring rule is using, Utilization from server semasupp log, FS threshold setting in nagios for FS and notification occurred in nagios.
This shows that at server the FS was at 92% at 14:22 HKT on 20th Sept however the alert notification in nagios received at 15:16 HKT which is 52 mins gap.
Looking at the template it says that alert should generate after 30 mins by calculating retry interval and max check attempts. Please advice what caused the delay in this scenario.
Also, Would like to check if there is fast track response service we can get for this case as the Client is looking for RCA and we do not want to delay the response, Thanks.
Support note: profile.zip and screenshots.zip downloaded and shared with team.
Re: Delayed FS notification in Nagios XI
I cannot find any reference to HKLPDSEBB03 in the log files in the profile. The logs also start on Sept 30th, rather than Sept 20th.
The max check attempts and retry interval account for 30 minutes of the 52 minutes, so we're down to 22 minutes.
Then the check interval is 5 minutes, so that brings us down to 17 minutes.
The first notification delay is 5 minutes, so that brings us down to 12 minutes.
Depending on when the last check finished, there could be another minute wait for cron to send off the notification, so 11 minutes.
Then there could be delays due to check plugins timing out, or delays in the email server. But we are down to 11 minutes, and that is the best I can get with the information at hand.
The max check attempts and retry interval account for 30 minutes of the 52 minutes, so we're down to 22 minutes.
Then the check interval is 5 minutes, so that brings us down to 17 minutes.
The first notification delay is 5 minutes, so that brings us down to 12 minutes.
Depending on when the last check finished, there could be another minute wait for cron to send off the notification, so 11 minutes.
Then there could be delays due to check plugins timing out, or delays in the email server. But we are down to 11 minutes, and that is the best I can get with the information at hand.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Delayed FS notification in Nagios XI
___________________________________________________________________
Then the check interval is 5 minutes, so that brings us down to 17 minutes.
The first notification delay is 5 minutes, so that brings us down to 12 minutes.
_________________________________________________________________________
Thanks for the response, the above two points, Need some clarity, Please help
1. If the File-system is already above defined threshold for 30 mins ( which is poll interval * lag) then whats logic behind another 5 mins ( 5mins for this case) check interval? Which is making alert notification be delayed for another 5 mins.
2. how to justify "5 mins first notification delay" , where is it defined ?
Let me know if you need more logs other than profile file , Like the nagios logs for the incident date or something to dig down more on this case.
Then the check interval is 5 minutes, so that brings us down to 17 minutes.
The first notification delay is 5 minutes, so that brings us down to 12 minutes.
_________________________________________________________________________
Thanks for the response, the above two points, Need some clarity, Please help
1. If the File-system is already above defined threshold for 30 mins ( which is poll interval * lag) then whats logic behind another 5 mins ( 5mins for this case) check interval? Which is making alert notification be delayed for another 5 mins.
2. how to justify "5 mins first notification delay" , where is it defined ?
Let me know if you need more logs other than profile file , Like the nagios logs for the incident date or something to dig down more on this case.
Re: Delayed FS notification in Nagios XI
Oh, actually you're correct, the check interval does not add to the notification lag. I was thinking from a perspective of knowing when the file system filled up, which is different from this issue.
The first notification delay is set in the service settings under the Alert Settings tab. It may also be set in the service template. Go to Configure -> Core Configuration Manager -> Services -> Select the service check in question and click the wrench icon to edit the service -> Click the Alert Settings tab.
If in this page the First Notification Delay field is empty, then it is probably being set at the service template level. Click back to Command Settings and click the Manage Templates button. You should see an Assigned template. For example, xiwizard_ncpa_service. That's the template that is applied.
So in the left menu click Service Templates, find the service template in question and click the wrench icon to edit. Click Alert Settings, and the First Notification Delay field should be set there.
The first notification delay is set in the service settings under the Alert Settings tab. It may also be set in the service template. Go to Configure -> Core Configuration Manager -> Services -> Select the service check in question and click the wrench icon to edit the service -> Click the Alert Settings tab.
If in this page the First Notification Delay field is empty, then it is probably being set at the service template level. Click back to Command Settings and click the Manage Templates button. You should see an Assigned template. For example, xiwizard_ncpa_service. That's the template that is applied.
So in the left menu click Service Templates, find the service template in question and click the wrench icon to edit. Click Alert Settings, and the First Notification Delay field should be set there.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!