Nagios timeout issue causing bulk (false) alerts...
Posted: Fri Nov 02, 2018 4:25 am
Hi Nagios Team,
Nagios Core 4.1.1
RHEL 6.10 (Santiago)
vCPU - 4
Ram - 12GB
Off late we have noticed the major increase in bulk alerts due to timeouts. This is causing major issues for us as it send our bulk (false) alerts, as the service recovers soon enough.
Please assist.
Regards,
Vish.
Below is the snippet from nagios log:
Also noticed extremely high load averages:
Below are the system settings :
Nagios Core 4.1.1
RHEL 6.10 (Santiago)
vCPU - 4
Ram - 12GB
Off late we have noticed the major increase in bulk alerts due to timeouts. This is causing major issues for us as it send our bulk (false) alerts, as the service recovers soon enough.
Please assist.
Regards,
Vish.
Below is the snippet from nagios log:
Code: Select all
[1540881446] Warning: Check of service 'mysql Node 2 System Disk Ephemeral Percent' on host 'cf-70063fb6c9cd84d8b810' timed out after 239.156s!
[1540881446] SERVICE ALERT: cf-70063fb6c9cd84d8b810;mysql Node 2 System Disk Ephemeral Percent;CRITICAL;SOFT;1;(Service check timed out after 239.16 seconds)
[1540881446] SERVICE EVENT HANDLER: cf-70063fb6c9cd84d8b810;mysql Node 2 System Disk Ephemeral Percent;CRITICAL;SOFT;1;send_svc_snmptrap
[1540881446] wproc: Core Worker 4866: job 353649 (pid=19816) timed out. Killing it
[1540881446] wproc: CHECK job 353649 from worker Core Worker 4866 timed out after 239.80s
[1540881446] wproc: host=p-mysql-d62aed61ea3264a67a8f; service=backup-prepare Node 0 System Disk System Percent;
[1540881446] wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1540881447] Warning: Check of service 'backup-prepare Node 0 System Disk System Percent' on host 'p-mysql-d62aed61ea3264a67a8f' timed out after 239.799s!
[1540881447] SERVICE ALERT: p-mysql-d62aed61ea3264a67a8f;backup-prepare Node 0 System Disk System Percent;CRITICAL;SOFT;1;(Service check timed out after 239.80 seconds)
[1540881447] SERVICE EVENT HANDLER: p-mysql-d62aed61ea3264a67a8f;backup-prepare Node 0 System Disk System Percent;CRITICAL;SOFT;1;send_svc_snmptrap
[1540881447] wproc: Core Worker 4866: job 353650 (pid=19823) timed out. Killing it
[1540881447] wproc: CHECK job 353650 from worker Core Worker 4866 timed out after 240.46s
[1540881447] wproc: host=cf-70063fb6c9cd84d8b810; service=syslog_scheduler Node 1 System Disk System Percent;
[1540881447] wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1540881447] Warning: Check of service 'syslog_scheduler Node 1 System Disk System Percent' on host 'cf-70063fb6c9cd84d8b810' timed out after 240.456s!
[1540881447] SERVICE ALERT: cf-70063fb6c9cd84d8b810;syslog_scheduler Node 1 System Disk System Percent;CRITICAL;SOFT;1;(Service check timed out after 240.46 seconds)
[1540881447] SERVICE EVENT HANDLER: cf-70063fb6c9cd84d8b810;syslog_scheduler Node 1 System Disk System Percent;CRITICAL;SOFT;1;send_svc_snmptrap
[1540881447] wproc: Core Worker 4866: job 353651 (pid=19832) timed out. Killing it
[1540881447] wproc: CHECK job 353651 from worker Core Worker 4866 timed out after 241.35s
[1540881447] wproc: host=cf-70063fb6c9cd84d8b810; service=diego_cell Node 17 System Disk Ephemeral Inode_percent;
[1540881447] wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1540881447] Warning: Check of service 'diego_cell Node 17 System Disk Ephemeral Inode_percent' on host 'cf-70063fb6c9cd84d8b810' timed out after 241.351s!
[1540881447] SERVICE ALERT: cf-70063fb6c9cd84d8b810;diego_cell Node 17 System Disk Ephemeral Inode_percent;CRITICAL;SOFT;1;(Service check timed out after 241.35 seconds)
[1540881447] SERVICE EVENT HANDLER: cf-70063fb6c9cd84d8b810;diego_cell Node 17 System Disk Ephemeral Inode_percent;CRITICAL;SOFT;1;send_svc_snmptrap
[1540881447] wproc: Core Worker 4866: job 353652 (pid=19847) timed out. Killing it
[1540881447] wproc: CHECK job 353652 from worker Core Worker 4866 timed out after 242.00s
Also noticed extremely high load averages:
Code: Select all
[Tue Oct 30 00:00:00 2018] CURRENT SERVICE STATE: localhost;Current Load;CRITICAL;HARD;4;CRITICAL - load average: 50.55, 40.52, 48.08
[Tue Oct 30 00:18:15 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 132.55, 66.90, 59.83
[Tue Oct 30 01:18:50 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 13.06, 30.32, 44.36
[Tue Oct 30 02:21:57 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 74.63, 33.25, 40.64
[Tue Oct 30 03:22:13 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 123.50, 58.17, 83.74
[Tue Oct 30 04:26:01 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 28.82, 30.08, 41.59
[Tue Oct 30 05:27:37 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 36.80, 64.63, 66.34
[Tue Oct 30 06:31:36 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 1.11, 33.49, 54.44
[Tue Oct 30 07:35:34 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 11.05, 61.68, 66.50
[Tue Oct 30 08:37:28 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 5.88, 81.14, 86.19
[Tue Oct 30 09:39:32 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 0.81, 19.89, 33.41
[Tue Oct 30 10:43:22 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 4.39, 35.15, 79.20
[Tue Oct 30 11:45:04 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 54.23, 65.02, 81.51
[Tue Oct 30 12:45:31 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 1125.57, 1018.16, 587.02
[Tue Oct 30 13:47:55 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 0.78, 195.91, 701.50
[Tue Oct 30 15:28:15 2018] SERVICE ALERT: localhost;Current Load;OK;HARD;4;OK - load average: 0.01, 0.02, 2.14
[Tue Oct 30 15:28:15 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;OK;notify-service-by-email;OK - load average: 0.01, 0.02, 2.14
Code: Select all
/etc/sysctl.conf
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
> ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 44997
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited