Nagios XI 5.4.4 Notification Lag?

emartine · Post by **emartine** » Thu Nov 02, 2017 11:22 am

I am trying to figure out why mail notifications were sent out ~25 minutes after Nagios showed OK status.

Check interval is 3
Retry interval is 1
Max check atttempts 3

The system is running:
Active Host Checks
1-min 642
5-min 2,022
15-min 2,022

Active Service Checks
1-min 5,043
5-min 16,778
15-min 16,779

We were testing a firewall. @ ~1:40PM. Users starting getting notifications at around ~1:43PM as expected.
@~1:48PM The firewall was turned off.

Nagios history showed the below also as expected.

Date / Time Host Service State State Type Attempt Information
2017-11-01 13:49:56 <servername>Disk - C OK HARD 3 of 3 C:\ - total: 127.00 Gb - used: 15.90 Gb (13%) - free 111.10 Gb (87%)
2017-11-01 13:49:52 <servername>Uptime OK HARD 3 of 3 System Uptime - 92 day(s) 21 hour(s) 3 minute(s)
2017-11-01 13:49:40 <servername>MongoDB Connection 27017 OK HARD 3 of 3 OK - State: 7 (Arbiter on port 27017)
2017-11-01 13:49:35 <servername>Eventlog OK HARD 3 of 3 Eventlog: Started
2017-11-01 13:48:58 <servername>Disk - F OK HARD 3 of 3 F:\ - total: 1022.87 Gb - used: 10.78 Gb (1%) - free 1012.09 Gb (99%)
2017-11-01 13:48:53 <servername>Disk - G OK HARD 3 of 3 G:\ - total: 1022.87 Gb - used: 0.89 Gb (0%) - free 1021.99 Gb (100%)
2017-11-01 13:48:47 <servername>Disk - D OK HARD 3 of 3 D:\ - total: 14.00 Gb - used: 1.19 Gb (8%) - free 12.81 Gb (92%)
2017-11-01 13:48:47 <servername>CPU Load OK HARD 3 of 3 CPU Load 0% (80 min average) 0% (180 min average) 0% (1440 min average)
2017-11-01 13:48:38 <servername>Memory Usage OK HARD 3 of 3 Memory usage: total:8319.62 MB - used: 1746.58 MB (21%) - free: 6573.04 MB (79%)
2017-11-01 13:45:42 <servername>Memory Usage CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:44:13 <servername>MongoDB Connection 27017 CRITICAL HARD 3 of 3 CRITICAL - Connection to Mongo server on <serverip:port> has failed
2017-11-01 13:44:09 <servername>Disk - C CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:44:05 <servername>Uptime CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:43:47 <servername>Eventlog CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:43:11 <servername>Disk - F CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:43:06 <servername>Disk - G CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:42:59 <servername>Disk - D CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:42:59 <servername>CPU Load CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds

Critical email notifications were still being sent by Nagios for services up until 2:11 PM.. See timestamp.

***** Nagios XI Alert *****

Nagios has detected a problem with this service.

Notification Type: PROBLEM

Service: MongoDB Connection
Host: <servername>
Address: <server ip>
State: CRITICAL
Info:
CRITICAL - Connection to Mongo server on <serverandport> has failed
Date/Time: 2017-11-01 14:11:07

OK Notifications were not sent out until after 2:12PM and that continued until 2:40PM.

***** Nagios XI Alert *****

Nagios has detected this service has recovered.

Notification Type: RECOVERY

Service: Eventlog
Host: <servername>
Address: <serverip>
State: OK
Info:
Eventlog: Started
Date/Time: 2017-11-01 14:39:49

Why is there about ~25 minute lag?

dwhitfield · Post by **dwhitfield** » Thu Nov 02, 2017 4:16 pm

If everything went down, it could have just been working through things. In older versions everything happened at once and this was not good for performance, so now things are scheduled.

My suggestion, without knowing more, would be to set up some parent-child relationships so XI isn't flooded when something like this happens. What those might be, would be up to you. You could also try spreading out the check frequency for things that aren't as important. That way, the important notifications go out first.

You could give the machine more resources, so that it can work through the backlog more quickly. For example, do you have a ramdisk? These are absolutely essential on a large system: https://assets.nagios.com/downloads/nag ... giosXI.pdf

I would not suggest turning off the scheduler, as there is a reason it is set as the default, but ultimately, I suppose that is an option.

That all said, I know very little about how your notifications are set up at this point. Can you PM me your Profile? You can download it by going to Admin > System Config > System Profile and click the ***Download Profile*** button towards the top. If for whatever reason you *cannot* download the profile, please put the output of View System Info (5.3.4+, Show Profile if older) in the thread (that will at least get us some info). This will give us access to many of the logs we would otherwise ask for individually. If security is a concern, you can unzip the profile take out what you like, and then zip it up again. We may end up needing something you remove, but we can ask for that specifically.

You can also generate a profile manually using the script at /usr/local/nagiosxi/html/includes/components/profile/getprofile.sh

That should generate a profile in /usr/local/nagiosxi/var/components/ which you can get off the server with an application such as FileZilla.

After you PM the profile, please update this thread. Updating this thread is the only way for it to show back up on our dashboard.

If you get an error that PROFILE BUILD FAILED, please see https://support.nagios.com/kb/article.p ... ategory=44

UPDATE: profile shared with techs

emartine · Post by **emartine** » Thu Nov 02, 2017 4:57 pm

PM profile sent. The disk that nagiosxi sits on is a FIO card.

Post by **tacolover101** » Thu Nov 02, 2017 8:05 pm

it would be awesome to have a better way for the profile method, so others in the community could also help. (perhaps sanitized? hosted?)

@emartine - does your notification log in XI show the proper timing of it being sent? if so, please post your objects.cache. i would suspect there is either timing at a host / service level, or an escalation that someone may have created.

you are comparing mongodb and eventlog which are two seperate services, so it's possible they are configured differently somewhere.

Post by **tgriep** » Fri Nov 03, 2017 11:16 am

The firewall was turned off at 1:48, when was it turned on?
Can you run a Notification report for the Service called MongoDB Connection from your example, export it as a CSV file and post that here?

emartine · Post by **emartine** » Thu Nov 09, 2017 2:21 pm

This isn't just mongo DB. All of the services for the hosts that were behind the firewall responded in the same manner.

Post by **tgriep** » Thu Nov 09, 2017 5:43 pm

I understand that there were more services that were having the same issue.
I just wanted that one service to be run to generate a shorter report so it will be easier to search, etc....

emartine · Post by **emartine** » Mon Nov 13, 2017 10:19 am

Today I just got a message from one of our networking guys. He disabled host notifications but still received an email notificiation for it.

emartine · Post by **emartine** » Mon Nov 13, 2017 10:40 am

I PM you the csv report for one host.

Post by **tgriep** » Mon Nov 13, 2017 10:55 am

I received the State History report and shared it with the other techs.
I took a look at the report and it doesn't show a delayed Down or UP state to cause the delayed emails.
The only thing I can think of is at that time, there were duplicate nagios processes and that could cause what you are seeing.
With as many host and services checks that your server is running, that could be the issue.

To fix this, you can read this KB article which will allow the nagios process more time to save the settings and not spawn a duplicate process.
https://support.nagios.com/kb/article/n ... anner.html

Try that and see it this helps.

The issue with your network guy, could be the same issue, duplicated nagios process. Make sure there is only one parent and one child nagios process running.

Nagios Support Forum

Nagios XI 5.4.4 Notification Lag?

Nagios XI 5.4.4 Notification Lag?

Re: Nagios XI 5.4.4 Notification Lag?

Re: Nagios XI 5.4.4 Notification Lag?

Re: Nagios XI 5.4.4 Notification Lag?

Re: Nagios XI 5.4.4 Notification Lag?

Re: Nagios XI 5.4.4 Notification Lag?

Re: Nagios XI 5.4.4 Notification Lag?

Re: Nagios XI 5.4.4 Notification Lag?

Re: Nagios XI 5.4.4 Notification Lag?

Re: Nagios XI 5.4.4 Notification Lag?