Nagios timeout issue causing bulk (false) alerts...

vishfx · Post by **vishfx** » Fri Nov 02, 2018 4:25 am

Hi Nagios Team,

Nagios Core 4.1.1
RHEL 6.10 (Santiago)
vCPU - 4
Ram - 12GB

Off late we have noticed the major increase in bulk alerts due to timeouts. This is causing major issues for us as it send our bulk (false) alerts, as the service recovers soon enough.

Please assist.

Regards,
Vish.

Below is the snippet from nagios log:

Code: Select all

[1540881446] Warning: Check of service 'mysql Node 2 System Disk Ephemeral Percent' on host 'cf-70063fb6c9cd84d8b810' timed out after 239.156s!
[1540881446] SERVICE ALERT: cf-70063fb6c9cd84d8b810;mysql Node 2 System Disk Ephemeral Percent;CRITICAL;SOFT;1;(Service check timed out after 239.16 seconds)
[1540881446] SERVICE EVENT HANDLER: cf-70063fb6c9cd84d8b810;mysql Node 2 System Disk Ephemeral Percent;CRITICAL;SOFT;1;send_svc_snmptrap
[1540881446] wproc: Core Worker 4866: job 353649 (pid=19816) timed out. Killing it
[1540881446] wproc: CHECK job 353649 from worker Core Worker 4866 timed out after 239.80s
[1540881446] wproc:   host=p-mysql-d62aed61ea3264a67a8f; service=backup-prepare Node 0 System Disk System Percent;
[1540881446] wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1540881447] Warning: Check of service 'backup-prepare Node 0 System Disk System Percent' on host 'p-mysql-d62aed61ea3264a67a8f' timed out after 239.799s!
[1540881447] SERVICE ALERT: p-mysql-d62aed61ea3264a67a8f;backup-prepare Node 0 System Disk System Percent;CRITICAL;SOFT;1;(Service check timed out after 239.80 seconds)
[1540881447] SERVICE EVENT HANDLER: p-mysql-d62aed61ea3264a67a8f;backup-prepare Node 0 System Disk System Percent;CRITICAL;SOFT;1;send_svc_snmptrap
[1540881447] wproc: Core Worker 4866: job 353650 (pid=19823) timed out. Killing it
[1540881447] wproc: CHECK job 353650 from worker Core Worker 4866 timed out after 240.46s
[1540881447] wproc:   host=cf-70063fb6c9cd84d8b810; service=syslog_scheduler Node 1 System Disk System Percent;
[1540881447] wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1540881447] Warning: Check of service 'syslog_scheduler Node 1 System Disk System Percent' on host 'cf-70063fb6c9cd84d8b810' timed out after 240.456s!
[1540881447] SERVICE ALERT: cf-70063fb6c9cd84d8b810;syslog_scheduler Node 1 System Disk System Percent;CRITICAL;SOFT;1;(Service check timed out after 240.46 seconds)
[1540881447] SERVICE EVENT HANDLER: cf-70063fb6c9cd84d8b810;syslog_scheduler Node 1 System Disk System Percent;CRITICAL;SOFT;1;send_svc_snmptrap
[1540881447] wproc: Core Worker 4866: job 353651 (pid=19832) timed out. Killing it
[1540881447] wproc: CHECK job 353651 from worker Core Worker 4866 timed out after 241.35s
[1540881447] wproc:   host=cf-70063fb6c9cd84d8b810; service=diego_cell Node 17 System Disk Ephemeral Inode_percent;
[1540881447] wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1540881447] Warning: Check of service 'diego_cell Node 17 System Disk Ephemeral Inode_percent' on host 'cf-70063fb6c9cd84d8b810' timed out after 241.351s!
[1540881447] SERVICE ALERT: cf-70063fb6c9cd84d8b810;diego_cell Node 17 System Disk Ephemeral Inode_percent;CRITICAL;SOFT;1;(Service check timed out after 241.35 seconds)
[1540881447] SERVICE EVENT HANDLER: cf-70063fb6c9cd84d8b810;diego_cell Node 17 System Disk Ephemeral Inode_percent;CRITICAL;SOFT;1;send_svc_snmptrap
[1540881447] wproc: Core Worker 4866: job 353652 (pid=19847) timed out. Killing it
[1540881447] wproc: CHECK job 353652 from worker Core Worker 4866 timed out after 242.00s

Also noticed extremely high load averages:

Code: Select all

[Tue Oct 30 00:00:00 2018] CURRENT SERVICE STATE: localhost;Current Load;CRITICAL;HARD;4;CRITICAL - load average: 50.55, 40.52, 48.08
[Tue Oct 30 00:18:15 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 132.55, 66.90, 59.83
[Tue Oct 30 01:18:50 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 13.06, 30.32, 44.36
[Tue Oct 30 02:21:57 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 74.63, 33.25, 40.64
[Tue Oct 30 03:22:13 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 123.50, 58.17, 83.74
[Tue Oct 30 04:26:01 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 28.82, 30.08, 41.59
[Tue Oct 30 05:27:37 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 36.80, 64.63, 66.34
[Tue Oct 30 06:31:36 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 1.11, 33.49, 54.44
[Tue Oct 30 07:35:34 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 11.05, 61.68, 66.50
[Tue Oct 30 08:37:28 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 5.88, 81.14, 86.19
[Tue Oct 30 09:39:32 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 0.81, 19.89, 33.41
[Tue Oct 30 10:43:22 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 4.39, 35.15, 79.20
[Tue Oct 30 11:45:04 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 54.23, 65.02, 81.51
[Tue Oct 30 12:45:31 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 1125.57, 1018.16, 587.02
[Tue Oct 30 13:47:55 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;CRITICAL;notify-service-by-email;CRITICAL - load average: 0.78, 195.91, 701.50
[Tue Oct 30 15:28:15 2018] SERVICE ALERT: localhost;Current Load;OK;HARD;4;OK - load average: 0.01, 0.02, 2.14
[Tue Oct 30 15:28:15 2018] SERVICE NOTIFICATION: nagiosadmin;localhost;Current Load;OK;notify-service-by-email;OK - load average: 0.01, 0.02, 2.14

Below are the system settings :

Code: Select all

/etc/sysctl.conf

kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296

> ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 44997
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Post by **cdienger** » Fri Nov 02, 2018 1:24 pm

How many hosts and services are there configured? What do the specs of the server look like as far as number of cpus and memory? Have there been any recent changes(new configs, hosts or services added, etc ?)

My first thought would be to get some basic information about what is running when the high load occurs. This can be done with help of event handlers: https://assets.nagios.com/downloads/nag ... dlers.html

Some good information to gather would be the output of these commands:

top -n 1
ps aux
tail -n 500 /var/log/messages
ipcs -a

vishfx · Post by **vishfx** » Mon Nov 05, 2018 8:06 am

Below VM spec :

Nagios Core 4.1.1
RHEL 6.10 (Santiago)
vCPU - 4
Ram - 12GB

is monitoring 86 hosts & 2054 services.

Each service has a event-handler configured , When critical this event handler , send a SNMP trap to Service-Now and creates a INCIDENT.

In this case we observed that network issues caused all multiple services to timeout, which in turn sent SNMP traps to Service-Now.

Which resulted in creation of 500+ tickets.

Please advise, how this can be prevented in future.

Regards,
Vish.

Post by **cdienger** » Mon Nov 05, 2018 2:03 pm

You can use child and parent relationships and configure the child hosts not to send notifications or send traps via the event handler if the parent is down:

https://assets.nagios.com/downloads/nag ... ility.html

Nagios Support Forum

Nagios timeout issue causing bulk (false) alerts...

Nagios timeout issue causing bulk (false) alerts...

Re: Nagios timeout issue causing bulk (false) alerts...

Re: Nagios timeout issue causing bulk (false) alerts...

Re: Nagios timeout issue causing bulk (false) alerts...