Nagios XI 5.4.4 Notification Lag?
Posted: Thu Nov 02, 2017 11:22 am
I am trying to figure out why mail notifications were sent out ~25 minutes after Nagios showed OK status.
Check interval is 3
Retry interval is 1
Max check atttempts 3
The system is running:
Active Host Checks
1-min 642
5-min 2,022
15-min 2,022
Active Service Checks
1-min 5,043
5-min 16,778
15-min 16,779
We were testing a firewall. @ ~1:40PM. Users starting getting notifications at around ~1:43PM as expected.
@~1:48PM The firewall was turned off.
Nagios history showed the below also as expected.
Date / Time Host Service State State Type Attempt Information
2017-11-01 13:49:56 <servername>Disk - C OK HARD 3 of 3 C:\ - total: 127.00 Gb - used: 15.90 Gb (13%) - free 111.10 Gb (87%)
2017-11-01 13:49:52 <servername>Uptime OK HARD 3 of 3 System Uptime - 92 day(s) 21 hour(s) 3 minute(s)
2017-11-01 13:49:40 <servername>MongoDB Connection 27017 OK HARD 3 of 3 OK - State: 7 (Arbiter on port 27017)
2017-11-01 13:49:35 <servername>Eventlog OK HARD 3 of 3 Eventlog: Started
2017-11-01 13:48:58 <servername>Disk - F OK HARD 3 of 3 F:\ - total: 1022.87 Gb - used: 10.78 Gb (1%) - free 1012.09 Gb (99%)
2017-11-01 13:48:53 <servername>Disk - G OK HARD 3 of 3 G:\ - total: 1022.87 Gb - used: 0.89 Gb (0%) - free 1021.99 Gb (100%)
2017-11-01 13:48:47 <servername>Disk - D OK HARD 3 of 3 D:\ - total: 14.00 Gb - used: 1.19 Gb (8%) - free 12.81 Gb (92%)
2017-11-01 13:48:47 <servername>CPU Load OK HARD 3 of 3 CPU Load 0% (80 min average) 0% (180 min average) 0% (1440 min average)
2017-11-01 13:48:38 <servername>Memory Usage OK HARD 3 of 3 Memory usage: total:8319.62 MB - used: 1746.58 MB (21%) - free: 6573.04 MB (79%)
2017-11-01 13:45:42 <servername>Memory Usage CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:44:13 <servername>MongoDB Connection 27017 CRITICAL HARD 3 of 3 CRITICAL - Connection to Mongo server on <serverip:port> has failed
2017-11-01 13:44:09 <servername>Disk - C CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:44:05 <servername>Uptime CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:43:47 <servername>Eventlog CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:43:11 <servername>Disk - F CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:43:06 <servername>Disk - G CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:42:59 <servername>Disk - D CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:42:59 <servername>CPU Load CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
Critical email notifications were still being sent by Nagios for services up until 2:11 PM.. See timestamp.
***** Nagios XI Alert *****
Nagios has detected a problem with this service.
Notification Type: PROBLEM
Service: MongoDB Connection
Host: <servername>
Address: <server ip>
State: CRITICAL
Info:
CRITICAL - Connection to Mongo server on <serverandport> has failed
Date/Time: 2017-11-01 14:11:07
OK Notifications were not sent out until after 2:12PM and that continued until 2:40PM.
***** Nagios XI Alert *****
Nagios has detected this service has recovered.
Notification Type: RECOVERY
Service: Eventlog
Host: <servername>
Address: <serverip>
State: OK
Info:
Eventlog: Started
Date/Time: 2017-11-01 14:39:49
Why is there about ~25 minute lag?
Check interval is 3
Retry interval is 1
Max check atttempts 3
The system is running:
Active Host Checks
1-min 642
5-min 2,022
15-min 2,022
Active Service Checks
1-min 5,043
5-min 16,778
15-min 16,779
We were testing a firewall. @ ~1:40PM. Users starting getting notifications at around ~1:43PM as expected.
@~1:48PM The firewall was turned off.
Nagios history showed the below also as expected.
Date / Time Host Service State State Type Attempt Information
2017-11-01 13:49:56 <servername>Disk - C OK HARD 3 of 3 C:\ - total: 127.00 Gb - used: 15.90 Gb (13%) - free 111.10 Gb (87%)
2017-11-01 13:49:52 <servername>Uptime OK HARD 3 of 3 System Uptime - 92 day(s) 21 hour(s) 3 minute(s)
2017-11-01 13:49:40 <servername>MongoDB Connection 27017 OK HARD 3 of 3 OK - State: 7 (Arbiter on port 27017)
2017-11-01 13:49:35 <servername>Eventlog OK HARD 3 of 3 Eventlog: Started
2017-11-01 13:48:58 <servername>Disk - F OK HARD 3 of 3 F:\ - total: 1022.87 Gb - used: 10.78 Gb (1%) - free 1012.09 Gb (99%)
2017-11-01 13:48:53 <servername>Disk - G OK HARD 3 of 3 G:\ - total: 1022.87 Gb - used: 0.89 Gb (0%) - free 1021.99 Gb (100%)
2017-11-01 13:48:47 <servername>Disk - D OK HARD 3 of 3 D:\ - total: 14.00 Gb - used: 1.19 Gb (8%) - free 12.81 Gb (92%)
2017-11-01 13:48:47 <servername>CPU Load OK HARD 3 of 3 CPU Load 0% (80 min average) 0% (180 min average) 0% (1440 min average)
2017-11-01 13:48:38 <servername>Memory Usage OK HARD 3 of 3 Memory usage: total:8319.62 MB - used: 1746.58 MB (21%) - free: 6573.04 MB (79%)
2017-11-01 13:45:42 <servername>Memory Usage CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:44:13 <servername>MongoDB Connection 27017 CRITICAL HARD 3 of 3 CRITICAL - Connection to Mongo server on <serverip:port> has failed
2017-11-01 13:44:09 <servername>Disk - C CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:44:05 <servername>Uptime CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:43:47 <servername>Eventlog CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:43:11 <servername>Disk - F CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:43:06 <servername>Disk - G CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:42:59 <servername>Disk - D CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
2017-11-01 13:42:59 <servername>CPU Load CRITICAL HARD 3 of 3 CRITICAL - Socket timeout after 10 seconds
Critical email notifications were still being sent by Nagios for services up until 2:11 PM.. See timestamp.
***** Nagios XI Alert *****
Nagios has detected a problem with this service.
Notification Type: PROBLEM
Service: MongoDB Connection
Host: <servername>
Address: <server ip>
State: CRITICAL
Info:
CRITICAL - Connection to Mongo server on <serverandport> has failed
Date/Time: 2017-11-01 14:11:07
OK Notifications were not sent out until after 2:12PM and that continued until 2:40PM.
***** Nagios XI Alert *****
Nagios has detected this service has recovered.
Notification Type: RECOVERY
Service: Eventlog
Host: <servername>
Address: <serverip>
State: OK
Info:
Eventlog: Started
Date/Time: 2017-11-01 14:39:49
Why is there about ~25 minute lag?