Page 1 of 2

High load average with 5.3.3?

Posted: Wed Nov 30, 2016 1:46 pm
by cbeattie-unitrends
Has anyone noticed a high load average after upgrading to Nagios XI 5.3.3? I upgraded two Nagios XI instances on November 25th, and since then the load average on both has gone way up. I updated the OS around the same time, but I've already rebooted back to the previous Linux kernel with no effect.

Is it safe to try installing 5.3.2 over 5.3.3? I have a pre-5.3.3 snapshot I can revert to, but I'd rather not lose the graph data.

The Nagios hosts have 8 CPUs with 32GB of RAM, running CentOS 7. One has 700+ hosts and almost 16K services, and the other has 600+ hosts and 13.5K services.

Re: High load average with 5.3.3?

Posted: Wed Nov 30, 2016 3:22 pm
by bwallace
To the best of my knowledge, we have not had any reports of this behavior. Around the time of the spike, what is being recorded in the event log?

Home > Monitoring Process > Event Log

During the high load /cpu can you post the output of:

Code: Select all

top
ps -aef
From a tech support perspective, it is not recommend to install an older version over new version, but I'm sure there's admins out there who have, perhaps they can chime in with some advice about it.

Re: High load average with 5.3.3?

Posted: Thu Dec 01, 2016 9:38 am
by cbeattie-unitrends
I see a lot of runtime warnings and errors in the event log:

Code: Select all

Information	11/27/2016 22:09	wproc: Core Worker 20378: job 464598 (pid=17919) timed out. Killing it
Service Warning	11/27/2016 22:09	SERVICE ALERT: den-ltr-hrip-c5dbe61954e6;CPU Usage;WARNING;SOFT;1;2 CPU, average load 96.0% > 95% : WARNING
Service Warning	11/27/2016 22:09	SERVICE ALERT: den-ltr-iwu-388ae4da0b90;CPU Usage;WARNING;SOFT;1;2 CPU, average load 96.0% > 95% : WARNING
Information	11/27/2016 22:08	wproc: Core Worker 20376: job 464451 (pid=15837): Dormant child reaped
Service Critical	11/27/2016 22:08	SERVICE ALERT: den-ltr-mrmc-2707288fd0bf;sshd;CRITICAL;SOFT;1;(Service check timed out after 60.01 seconds)
Runtime Warning	11/27/2016 22:08	Warning: Check of service 'sshd' on host 'den-ltr-mrmc-2707288fd0bf' timed out after 60.005s!
Runtime Error	11/27/2016 22:08	wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Runtime Error	11/27/2016 22:08	wproc:   host=den-ltr-mrmc-2707288fd0bf; service=sshd;
Runtime Error	11/27/2016 22:08	wproc: CHECK job 464451 from worker Core Worker 20376 timed out after 60.01s
Information	11/27/2016 22:08	wproc: Core Worker 20376: job 464451 (pid=15837) timed out. Killing it
Service Warning	11/27/2016 22:08	SERVICE ALERT: esxi106.unit.den3.loc;esx_CPU;WARNING;SOFT;1;32 CPU, average load 42.7% > 40% : WARNING
Service Warning	11/27/2016 22:08	SERVICE ALERT: den-ltr-lrmhmr-649241bbb91f;CPU Usage;WARNING;SOFT;1;2 CPU, average load 98.0% > 95% : WARNING
Process Information	11/27/2016 22:08	Auto-save of retention data completed successfully.
Information	11/27/2016 22:08	wproc: Core Worker 20367: job 464375 (pid=14770): Dormant child reaped
Runtime Warning	11/27/2016 22:08	Warning: Check of service 'xinetd' on host 'den-ltr91' timed out after 60.006s!
Runtime Error	11/27/2016 22:08	wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Runtime Error	11/27/2016 22:08	wproc:   host=den-ltr91; service=xinetd;
Runtime Error	11/27/2016 22:08	wproc: CHECK job 464375 from worker Core Worker 20367 timed out after 60.01s
Information	11/27/2016 22:08	wproc: Core Worker 20367: job 464375 (pid=14770) timed out. Killing it
Information	11/27/2016 22:08	wproc: Core Worker 20368: job 464375 (pid=14769): Dormant child reaped
This is what top looks like during high load. Almost all of our checks are SNMP-based, so I'd expect to see that in there a lot.

Code: Select all

[root@den-nagios ~]# top
top - 14:35:46 up  6:44,  1 user,  load average: 95.84, 61.28, 46.69
Tasks: 554 total, 296 running, 258 sleeping,   0 stopped,   0 zombie
%Cpu(s): 60.6 us,  6.8 sy,  0.0 ni, 32.5 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 32947500 total, 28688868 free,  2793988 used,  1464644 buff/cache
KiB Swap: 16515068 total, 16515068 free,        0 used. 29798440 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 7427 root      20   0       0      0      0 R  42.9  0.0  48:27.38 kworker/u16:0
 1915 nagios    20   0   61408  41832   1328 R  41.9  0.1  20:38.50 nagios
 1748 mysql     20   0 4719360 196528   9764 S   6.3  0.6  28:34.75 mysqld
 1943 nagios    20   0  136088   6580   1276 S   5.6  0.0   9:42.63 ndo2db
  637 nagios    20   0  434596  23416   9628 R   3.7  0.1   0:00.22 php
 2081 nagios    20   0  159140  11660   2396 R   2.3  0.0   0:00.07 check_snmp_proc
 2105 nagios    20   0  159272  11712   2452 R   2.3  0.0   0:00.07 check_snmp_proc
 2106 nagios    20   0  159272  11716   2452 R   2.3  0.0   0:00.07 check_snmp_proc
 2117 nagios    20   0  159272  11716   2452 R   2.3  0.0   0:00.07 check_snmp_proc
 2130 nagios    20   0  159272  11716   2452 R   2.3  0.0   0:00.07 check_snmp_proc
 2138 nagios    20   0  159272  11712   2452 R   2.3  0.0   0:00.07 check_snmp_proc
 2144 nagios    20   0  159272  11712   2452 R   2.3  0.0   0:00.07 check_snmp_proc
 2146 nagios    20   0  159272  11712   2452 R   2.3  0.0   0:00.07 check_snmp_proc
 2147 nagios    20   0  159272  11716   2452 R   2.3  0.0   0:00.07 check_snmp_proc
 2189 nagios    20   0  159140  11716   2452 R   2.3  0.0   0:00.07 check_snmp_proc
 2077 nagios    20   0  159140  11656   2396 R   2.0  0.0   0:00.06 check_snmp_proc
 2086 nagios    20   0  159140  11664   2396 R   2.0  0.0   0:00.06 check_snmp_proc
 2087 nagios    20   0  159140  11660   2396 R   2.0  0.0   0:00.06 check_snmp_proc
 2098 nagios    20   0  159140  11660   2396 S   2.0  0.0   0:00.06 check_snmp_proc
 2100 nagios    20   0  140656   9208   2116 D   2.0  0.0   0:00.06 check_snmp_proc
 2101 nagios    20   0  154548  10920   2200 R   2.0  0.0   0:00.06 check_snmp_proc
 2109 nagios    20   0  159140  11656   2396 R   2.0  0.0   0:00.06 check_snmp_proc
 2115 nagios    20   0  159140  11664   2396 R   2.0  0.0   0:00.06 check_snmp_proc
 2118 nagios    20   0  154548  10916   2200 R   2.0  0.0   0:00.06 check_snmp_proc
 2126 nagios    20   0  159140  11660   2396 R   2.0  0.0   0:00.06 check_snmp_proc
 2132 nagios    20   0  159140  11656   2396 R   2.0  0.0   0:00.06 check_snmp_proc
 2139 nagios    20   0  159140  11660   2396 R   2.0  0.0   0:00.06 check_snmp_proc
 2142 nagios    20   0  159140  11660   2396 R   2.0  0.0   0:00.06 check_snmp_proc
 2148 nagios    20   0  159140  11660   2396 R   2.0  0.0   0:00.06 check_snmp_proc
 2153 nagios    20   0  159140  11656   2396 R   2.0  0.0   0:00.06 check_snmp_proc
 2154 nagios    20   0  159140  11660   2396 R   2.0  0.0   0:00.06 check_snmp_proc
 2168 nagios    20   0  159272  11716   2452 R   2.0  0.0   0:00.06 check_snmp_proc
 2170 nagios    20   0  159140  11712   2452 R   2.0  0.0   0:00.06 check_snmp_proc
 2172 nagios    20   0  154548  10920   2200 R   2.0  0.0   0:00.06 check_snmp_proc
 2178 nagios    20   0  154548  10924   2200 R   2.0  0.0   0:00.06 check_snmp_proc
 2180 nagios    20   0  159140  11660   2396 R   2.0  0.0   0:00.06 check_snmp_proc
 2192 nagios    20   0  159140  11660   2396 R   2.0  0.0   0:00.06 check_snmp_proc
 2195 nagios    20   0  159272  11716   2452 R   2.0  0.0   0:00.06 check_snmp_proc
 2200 nagios    20   0  159140  11656   2396 R   2.0  0.0   0:00.06 check_snmp_proc
 2201 nagios    20   0  159140  11660   2396 R   2.0  0.0   0:00.06 check_snmp_proc
 2207 nagios    20   0  159140  11660   2396 R   2.0  0.0   0:00.06 check_snmp_proc
 2210 nagios    20   0  159140  11660   2396 R   2.0  0.0   0:00.06 check_snmp_proc
 2222 nagios    20   0  159140  11656   2396 R   2.0  0.0   0:00.06 check_snmp_proc
 2237 nagios    20   0  159140  11660   2396 R   2.0  0.0   0:00.06 check_snmp_proc
 2238 nagios    20   0  159140  11712   2452 R   2.0  0.0   0:00.06 check_snmp_proc
 2239 nagios    20   0  154548  10920   2200 R   2.0  0.0   0:00.06 check_snmp_proc
 2241 nagios    20   0  154944  11544   2304 R   2.0  0.0   0:00.06 check_snmp_proc
And here is the ps output:

Code: Select all

[root@den-nagios ~]# ps -aef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 07:51 ?        00:00:13 /usr/lib/systemd/systemd --switched-root --system --deserialize 21
root         2     0  0 07:51 ?        00:00:00 [kthreadd]
root         3     2  0 07:51 ?        00:01:52 [ksoftirqd/0]
root         5     2  0 07:51 ?        00:00:00 [kworker/0:0H]
root         7     2  0 07:51 ?        00:02:32 [migration/0]
root         8     2  0 07:51 ?        00:00:00 [rcu_bh]
root         9     2  0 07:51 ?        00:00:00 [rcuob/0]
root        10     2  0 07:51 ?        00:00:00 [rcuob/1]
root        11     2  0 07:51 ?        00:00:00 [rcuob/2]
root        12     2  0 07:51 ?        00:00:00 [rcuob/3]
root        13     2  0 07:51 ?        00:00:00 [rcuob/4]
root        14     2  0 07:51 ?        00:00:00 [rcuob/5]
root        15     2  0 07:51 ?        00:00:00 [rcuob/6]
root        16     2  0 07:51 ?        00:00:00 [rcuob/7]
root        17     2  0 07:51 ?        00:00:51 [rcu_sched]
root        18     2  0 07:51 ?        00:00:10 [rcuos/0]
root        19     2  0 07:51 ?        00:00:09 [rcuos/1]
root        20     2  0 07:51 ?        00:00:09 [rcuos/2]
root        21     2  0 07:51 ?        00:00:09 [rcuos/3]
root        22     2  0 07:51 ?        00:00:09 [rcuos/4]
root        23     2  0 07:51 ?        00:00:10 [rcuos/5]
root        24     2  0 07:51 ?        00:00:10 [rcuos/6]
root        25     2  0 07:51 ?        00:00:10 [rcuos/7]
root        26     2  0 07:51 ?        00:00:35 [watchdog/0]
root        27     2  0 07:51 ?        00:00:29 [watchdog/1]
root        28     2  0 07:51 ?        00:03:12 [migration/1]
root        29     2  0 07:51 ?        00:02:02 [ksoftirqd/1]
root        31     2  0 07:51 ?        00:00:00 [kworker/1:0H]
root        32     2  0 07:51 ?        00:00:33 [watchdog/2]
root        33     2  0 07:51 ?        00:02:37 [migration/2]
root        34     2  0 07:51 ?        00:01:35 [ksoftirqd/2]
root        37     2  0 07:51 ?        00:00:33 [watchdog/3]
root        38     2  0 07:51 ?        00:02:44 [migration/3]
root        39     2  0 07:51 ?        00:01:39 [ksoftirqd/3]
root        41     2  0 07:51 ?        00:00:00 [kworker/3:0H]
root        42     2  0 07:51 ?        00:00:32 [watchdog/4]
root        43     2  0 07:51 ?        00:02:32 [migration/4]
root        44     2  0 07:51 ?        00:01:33 [ksoftirqd/4]
root        46     2  0 07:51 ?        00:00:00 [kworker/4:0H]
root        47     2  0 07:51 ?        00:00:34 [watchdog/5]
root        48     2  0 07:51 ?        00:02:37 [migration/5]
root        49     2  0 07:51 ?        00:01:54 [ksoftirqd/5]
root        51     2  0 07:51 ?        00:00:00 [kworker/5:0H]
root        52     2  0 07:51 ?        00:00:34 [watchdog/6]
root        53     2  0 07:51 ?        00:02:31 [migration/6]
root        54     2  0 07:51 ?        00:01:39 [ksoftirqd/6]
root        57     2  0 07:51 ?        00:00:33 [watchdog/7]
root        58     2  0 07:51 ?        00:02:34 [migration/7]
root        59     2  0 07:51 ?        00:01:36 [ksoftirqd/7]
root        61     2  0 07:51 ?        00:00:00 [kworker/7:0H]
root        62     2  0 07:51 ?        00:00:00 [khelper]
root        63     2  0 07:51 ?        00:00:00 [kdevtmpfs]
root        64     2  0 07:51 ?        00:00:00 [netns]
root        65     2  0 07:51 ?        00:00:00 [perf]
root        66     2  0 07:51 ?        00:00:00 [writeback]
root        67     2  0 07:51 ?        00:00:00 [kintegrityd]
root        68     2  0 07:51 ?        00:00:00 [bioset]
root        69     2  0 07:51 ?        00:00:00 [kblockd]
root        70     2  0 07:51 ?        00:00:00 [md]
root        75     2  0 07:51 ?        00:00:00 [khungtaskd]
root        76     2  0 07:51 ?        00:00:00 [kswapd0]
root        77     2  0 07:51 ?        00:00:00 [ksmd]
root        78     2  0 07:51 ?        00:00:15 [khugepaged]
root        79     2  0 07:51 ?        00:00:00 [fsnotify_mark]
root        80     2  0 07:51 ?        00:00:00 [crypto]
root        88     2  0 07:51 ?        00:00:00 [kthrotld]
root        90     2  0 07:51 ?        00:00:00 [kmpath_rdacd]
root        91     2  0 07:51 ?        00:00:00 [kpsmoused]
root        93     2  0 07:51 ?        00:00:00 [ipv6_addrconf]
root       112     2  0 07:51 ?        00:00:00 [deferwq]
root       147     2  0 07:51 ?        00:00:00 [kauditd]
root       319     2  0 07:51 ?        00:00:00 [scsi_eh_0]
root       320     2  0 07:51 ?        00:00:00 [ata_sff]
root       321     2  0 07:51 ?        00:00:00 [scsi_tmf_0]
root       322     2  0 07:51 ?        00:00:00 [vmw_pvscsi_wq_0]
root       333     2  0 07:51 ?        00:00:00 [scsi_eh_1]
root       335     2  0 07:51 ?        00:00:00 [events_power_ef]
root       336     2  0 07:51 ?        00:00:00 [scsi_tmf_1]
root       338     2  0 07:51 ?        00:00:00 [scsi_eh_2]
root       339     2  0 07:51 ?        00:00:00 [scsi_tmf_2]
root       344     2  0 07:51 ?        00:00:00 [ttm_swap]
root       377     2  0 07:51 ?        00:00:10 [kworker/5:1H]
root       438     2  0 07:51 ?        00:00:00 [kdmflush]
root       439     2  0 07:51 ?        00:00:00 [bioset]
root       450     2  0 07:51 ?        00:00:00 [kdmflush]
root       451     2  0 07:51 ?        00:00:00 [bioset]
root       464     2  0 07:51 ?        00:00:00 [xfsalloc]
root       465     2  0 07:51 ?        00:00:00 [xfs_mru_cache]
root       466     2  0 07:51 ?        00:00:00 [xfs-buf/dm-0]
root       467     2  0 07:51 ?        00:00:00 [xfs-data/dm-0]
root       468     2  0 07:51 ?        00:00:00 [xfs-conv/dm-0]
root       469     2  0 07:51 ?        00:00:00 [xfs-cil/dm-0]
root       470     2  4 07:51 ?        00:17:50 [xfsaild/dm-0]
root       471     2  0 07:51 ?        00:00:06 [kworker/3:1H]
root       546     1  0 07:51 ?        00:00:07 /usr/lib/systemd/systemd-journald
root       562     1  0 07:51 ?        00:00:00 /usr/sbin/lvmetad -f
root       570     1  0 07:51 ?        00:00:00 /usr/lib/systemd/systemd-udevd
root       624     2  0 07:51 ?        00:00:11 [kworker/0:1H]
root       668     2  0 07:51 ?        00:00:00 [xfs-buf/sda2]
root       669     2  0 07:51 ?        00:00:00 [xfs-data/sda2]
root       670     2  0 07:51 ?        00:00:00 [xfs-conv/sda2]
root       671     2  0 07:51 ?        00:00:00 [xfs-cil/sda2]
root       672     2  0 07:51 ?        00:00:00 [xfsaild/sda2]
root       673     2  0 07:51 ?        00:00:00 [kdmflush]
root       674     2  0 07:51 ?        00:00:00 [bioset]
root       685     2  0 07:51 ?        00:00:00 [xfs-buf/dm-2]
root       686     2  0 07:51 ?        00:00:00 [xfs-data/dm-2]
root       687     2  0 07:51 ?        00:00:00 [xfs-conv/dm-2]
root       688     2  0 07:51 ?        00:00:00 [xfs-cil/dm-2]
root       689     2  0 07:51 ?        00:00:11 [xfsaild/dm-2]
root       700     1  0 07:51 ?        00:01:03 /sbin/auditd -n
root       724     1  0 07:51 ?        00:00:07 /usr/lib/systemd/systemd-logind
root       728     1  0 07:51 ?        00:00:03 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid
root       729     1  0 07:51 ?        00:00:03 /usr/sbin/irqbalance --foreground
root       731     1  0 07:51 ?        00:00:22 /usr/bin/vmtoolsd
dbus       732     1  0 07:51 ?        00:00:17 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-a
chrony     734     1  0 07:51 ?        00:00:00 /usr/sbin/chronyd
root       739     1  0 07:51 ?        00:00:01 /usr/sbin/rsyslogd -n
root       745     1  0 07:51 ?        00:00:01 /usr/sbin/crond -n
root       751     1  0 07:51 tty1     00:00:00 /sbin/agetty --noclear tty1 linux
root       768     2  0 07:51 ?        00:00:05 [kworker/7:1H]
root       771     2  0 14:35 ?        00:00:00 [kworker/4:2]
root       811     1  0 07:51 ?        00:00:06 /usr/sbin/NetworkManager --no-daemon
root      1138     1  0 07:51 ?        00:00:00 /usr/sbin/wpa_supplicant -u -f /var/log/wpa_supplicant.log -c /etc/wpa_suppli
polkitd   1139     1  0 07:51 ?        00:00:04 /usr/lib/polkit-1/polkitd --no-debug
root      1418     1  0 07:51 ?        00:00:00 /usr/sbin/sshd -D
root      1425     1  0 07:51 ?        00:00:17 /usr/sbin/httpd -DFOREGROUND
root      1426     1  0 07:51 ?        00:00:05 /usr/bin/python -Es /usr/sbin/tuned -l -P
root      1429     1  0 07:51 ?        00:00:00 /usr/sbin/xinetd -stayalive -pidfile /var/run/xinetd.pid
nagios    1431     1  0 07:51 ?        00:00:03 /usr/local/nagios/bin/npcd -d -f /usr/local/nagios/etc/pnp/npcd.cfg
mysql     1474     1  0 07:51 ?        00:00:00 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
nagios    1494     1  0 07:51 ?        00:00:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
mysql     1748  1474  7 07:51 ?        00:28:37 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr
ajaxterm  1797     1  0 07:51 ?        00:00:10 python /usr/share/ajaxterm/ajaxterm.py --daemon --port=8022 --uid=ajaxterm
root      1800     1  0 07:51 ?        00:00:00 /usr/libexec/postfix/master -w
postfix   1803  1800  0 07:51 ?        00:00:00 qmgr -l -t unix -u
root      1912     2  0 14:19 ?        00:00:00 [kworker/7:1]
nagios    1915     1  5 07:52 ?        00:20:40 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios    1917  1915  0 07:52 ?        00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    1919  1915  0 07:52 ?        00:00:20 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    1920  1915  0 07:52 ?        00:00:17 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    1921  1915  0 07:52 ?        00:00:19 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    1922  1915  0 07:52 ?        00:00:19 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    1923  1915  0 07:52 ?        00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    1924  1915  0 07:52 ?        00:00:16 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    1925  1915  0 07:52 ?        00:00:20 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    1926  1915  0 07:52 ?        00:00:18 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    1927  1915  0 07:52 ?        00:00:19 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    1928  1915  0 07:52 ?        00:00:19 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    1929  1915  0 07:52 ?        00:00:21 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    1942  1494  0 07:52 ?        00:00:17 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
nagios    1943  1942  2 07:52 ?        00:09:43 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
root      2004  1418  0 07:52 ?        00:00:08 sshd: root@pts/0
apache    2017  1425  0 14:35 ?        00:00:00 /usr/sbin/httpd -DFOREGROUND
nagios    2023  1915  0 07:52 ?        00:00:03 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root      2116     2  0 14:20 ?        00:00:00 [kworker/3:0]
root      2177     2  0 07:52 ?        00:00:08 [kworker/4:1H]
root      2182     2  0 07:52 ?        00:00:06 [kworker/1:1H]
root      2184  2004  0 07:52 pts/0    00:00:00 -bash
root      2841   745  0 14:36 ?        00:00:00 /usr/sbin/CROND -n
root      2842   745  0 14:36 ?        00:00:00 /usr/sbin/CROND -n
root      2843   745  0 14:36 ?        00:00:00 /usr/sbin/CROND -n
root      2844   745  0 14:36 ?        00:00:00 /usr/sbin/CROND -n
root      2845   745  0 14:36 ?        00:00:00 /usr/sbin/CROND -n
root      2846   745  0 14:36 ?        00:00:00 /usr/sbin/CROND -n
nagios    2847  2841  0 14:36 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php > /usr/l
nagios    2850  2843  0 14:36 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/event_handler.php > /usr/
nagios    2853  2850  0 14:36 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/event_handler.php
nagios    2856  2845  0 14:36 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php > /usr/loca
nagios    2857  2846  0 14:36 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php > /usr/local/
nagios    2859  2857  4 14:36 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php
nagios    2860  2844  0 14:36 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php > /usr/local
nagios    2861  2860  0 14:36 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php
nagios    2862  2842  0 14:36 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php > /usr/local
nagios    2865  2862  0 14:36 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php
nagios    2867  2847  0 14:36 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php
nagios    2869  2856  0 14:36 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php
apache    3266  1425  0 14:36 ?        00:00:00 /usr/sbin/httpd -DFOREGROUND
apache    3267  1425  0 14:36 ?        00:00:00 /usr/sbin/httpd -DFOREGROUND
nagios    3388  1926 11 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_uptime.pl --perfparse -c --h
nagios    3469  1920 12 14:36 ?        00:00:00 [check_snmp_proc] <defunct>
nagios    3472  1920  9 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3517  1920  4 14:36 ?        00:00:00 [check_snmp_proc] <defunct>
nagios    3521  1926  4 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3522  1923  4 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3523  1921 17 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3524  1921  1 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3526  1928  2 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3527  1928 10 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3529  1921  4 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3530  1928  2 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_uptime.pl --perfparse -c --h
nagios    3533  1921  4 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3534  1928  5 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3535  1928  6 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3537  1921  2 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3542  1925  5 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3543  1925  1 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_load_wizard.pl -H den-l
nagios    3544  1925  2 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3545  1925  5 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3550  1924  5 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3551  1924  1 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3558  1927  5 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3559  1927  4 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3560  1927  1 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_uptime.pl --perfparse -c --h
nagios    3561  1927  1 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3563  1927  4 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3564  1927  1 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_load_wizard.pl -H den-l
nagios    3571  1923  1 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3577  1926  2 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3578  1926  5 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3580  1926  1 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3587  1929  3 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3588  1929  3 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3589  1929  7 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3590  1929  6 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3592  1929  3 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3593  1929  8 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3596  1922  7 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3597  1922  3 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3598  1922  7 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3599  1922  3 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3600  1922  7 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3601  1922  7 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3602  1922  3 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3608  1920  3 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3609  1920  1 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3610  1920  4 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3611  1920  6 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3612  1920  7 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3613  1920  6 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3614  1920  2 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3615  1920  0 14:36 ?        00:00:00 [check_icmp] <defunct>
nagios    3617  1920  0 14:36 ?        00:00:00 [check_icmp] <defunct>
nagios    3618  1919  2 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3619  1919  3 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3620  1919  3 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_load_wizard.pl -H den-l
nagios    3621  1919  9 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3622  1919  7 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3623  1919  2 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3624  1919  6 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3628  1917  2 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3629  1917  2 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3630  1917  3 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3631  1917  4 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3633  1917  3 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3634  1917  3 14:36 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H de
nagios    3656  1920  0 14:36 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    3661  1926  0 14:36 ?        00:00:00 /usr/local/nagios/libexec/check_icmp -H den-ltr-nukusa-4d57815a27a3.unitrends
root      3662  2184  0 14:36 pts/0    00:00:00 ps -aef
nagios    3663  1917  0 14:36 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
root      3760     2  0 14:21 ?        00:00:01 [kworker/2:2]
root      3762     2  0 14:21 ?        00:00:01 [kworker/6:0H]
root      7427     2 34 12:14 ?        00:48:46 [kworker/u16:0]
root      8035     2  0 14:06 ?        00:00:10 [kworker/1:0]
root     10930     2  0 13:53 ?        00:00:07 [kworker/4:0]
root     11174     2  0 13:53 ?        00:00:11 [kworker/5:2]
root     11509     2  0 09:01 ?        00:00:49 [kworker/2:2H]
apache   11613  1425  2 14:24 ?        00:00:19 /usr/sbin/httpd -DFOREGROUND
root     12507     2  0 14:25 ?        00:00:00 [kworker/0:3]
root     12513     2  0 14:25 ?        00:00:00 [kworker/1:1]
root     14124     2  0 12:02 ?        00:00:00 [kworker/2:0H]
root     15140     2  0 14:26 ?        00:00:00 [kworker/6:0]
root     15448     2  0 14:26 ?        00:00:00 [kworker/2:0]
root     17560     2  0 14:27 ?        00:00:01 [kworker/6:2H]
root     19099     2  0 14:28 ?        00:00:00 [kworker/5:0]
root     19162     2  0 14:28 ?        00:00:03 [kworker/7:2]
root     21099     2  0 13:10 ?        00:00:00 [kworker/u16:1]
root     21195     2  0 14:29 ?        00:00:00 [kworker/4:1]
root     23966     2  0 14:14 ?        00:00:07 [kworker/6:1]
root     24874     2  0 14:31 ?        00:00:02 [kworker/0:0]
root     25723     2  0 14:15 ?        00:00:00 [kworker/u16:3]
apache   26028  1425  1 14:31 ?        00:00:04 /usr/sbin/httpd -DFOREGROUND
root     26442     2  0 14:00 ?        00:00:10 [kworker/3:3]
root     26706     2  0 14:32 ?        00:00:00 [kworker/6:2]
root     26728     2  0 14:32 ?        00:00:00 [kworker/1:2]
root     26756     2  0 14:32 ?        00:00:00 [kworker/3:1]
postfix  27008  1800  0 14:32 ?        00:00:00 pickup -l -t unix -u
apache   27821  1425  1 14:32 ?        00:00:02 /usr/sbin/httpd -DFOREGROUND
root     27978     2  0 14:32 ?        00:00:00 [kworker/6:1H]
root     28037     2  0 14:16 ?        00:00:06 [kworker/2:1]
root     28172     2  0 14:32 ?        00:00:00 [kworker/2:3]
root     28369     2  0 14:33 ?        00:00:00 [kworker/5:1]
apache   29943  1425  2 14:33 ?        00:00:03 /usr/sbin/httpd -DFOREGROUND
apache   29944  1425  2 14:33 ?        00:00:04 /usr/sbin/httpd -DFOREGROUND
apache   30181  1425  1 14:33 ?        00:00:01 /usr/sbin/httpd -DFOREGROUND
root     30858     2  0 14:34 ?        00:00:00 [kworker/7:0]
root     31124     2  0 14:34 ?        00:00:00 [kworker/u16:2]
apache   31554  1425  1 14:34 ?        00:00:01 /usr/sbin/httpd -DFOREGROUND
apache   32532  1425  3 14:34 ?        00:00:03 /usr/sbin/httpd -DFOREGROUND
[root@den-nagios ~]#
I cloned a new VM from a snapshot I took of the Nagios host before I upgraded Nagios XI and updated the OS. Its load average is around 3 so far, but it hasn't been running for very long yet.

These are the packages that would be updated if I let yum run. There are a couple PHP updates, so it's possible one of them is part of the problem:
  • Resolving Dependencies
    --> Running transaction check
    ---> Package epel-release.noarch 0:7-2 will be updated
    ---> Package epel-release.noarch 0:7-8 will be an update
    ---> Package php-mcrypt.x86_64 0:5.4.16-5.el7 will be updated
    ---> Package php-mcrypt.x86_64 0:5.4.16-7.el7 will be an update
    ---> Package php-mssql.x86_64 0:5.4.16-5.el7 will be updated
    ---> Package php-mssql.x86_64 0:5.4.16-7.el7 will be an update
    ---> Package python-simplejson.x86_64 0:3.3.3-1.el7 will be updated
    ---> Package python-simplejson.x86_64 0:3.5.3-1.el7 will be an update
    --> Finished Dependency Resolution

Re: High load average with 5.3.3?

Posted: Thu Dec 01, 2016 11:50 am
by dwhitfield
Can you PM me or another tech your profile? Admin > System Config > System Profile

After you PM the profile, please make sure you update this thread. That's the only way it will show up on our dashboard. Thanks!

UPDATE: profile received and shared with techs

Re: High load average with 5.3.3?

Posted: Thu Dec 01, 2016 5:11 pm
by cbeattie-unitrends
I've PMed the profile, thanks. For comparison, I've also attached current load average graphs from the Nagios host and the clone I made of its snapshot from before I installed 5.3.3. Almost 10x! :o

I figure I'll install the OS update packages one at a time and see if any of them cause the load average to go haywire.

Re: High load average with 5.3.3?

Posted: Thu Dec 01, 2016 5:19 pm
by avandemore
Did you have base commands or wizard generated commands in the pre-5.3.3 XI installs? Your info shows a lot of failures. Failures from checks are expensive in much the same way an exception is expensive in programming. Such failures often occur after an upgrade where the user has customized commands or wizards as the upgrade would revert those items to baseline. There is a warning about this prior to the upgrade being run.

Re: High load average with 5.3.3?

Posted: Fri Dec 02, 2016 9:22 am
by cbeattie-unitrends
I don't think I understand your question correctly. Some of the service checks were initially created by running the autodiscover wizard, but I reconfigured them to be assigned by hostgroup membership instead. As I've added more service checks, I've either written them from scratch starting by adding a new command (check_snmp_uptime for example) or I've copied an existing service check (which may have used a check_xi command) and modified the copy. I did add a "--timeout=60" parameter to the existing check_xi_service_snmp_linux_process command, but that persisted through the 5.3.3 upgrade.

Can you point me at the failures you found? All the service and command objects should be the same between my 5.3.3 host and its clone still running 5.3.2, with the only other difference between the two being the OS updates listed above.

Re: High load average with 5.3.3?

Posted: Fri Dec 02, 2016 2:53 pm
by avandemore
The ones you posted from the event log:

Code: Select all

Service Critical   11/27/2016 22:08   SERVICE ALERT: den-ltr-mrmc-2707288fd0bf;sshd;CRITICAL;SOFT;1;(Service check timed out after 60.01 seconds)
Runtime Warning   11/27/2016 22:08   Warning: Check of service 'sshd' on host 'den-ltr-mrmc-2707288fd0bf' timed out after 60.005s!
Runtime Error   11/27/2016 22:08   wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Runtime Error   11/27/2016 22:08   wproc:   host=den-ltr-mrmc-2707288fd0bf; service=sshd;
Runtime Error   11/27/2016 22:08   wproc: CHECK job 464451 from worker Core Worker 20376 timed out after 60.01s
Information   11/27/2016 22:08   wproc: Core Worker 20376: job 464451 (pid=15837) timed out. Killing it

Re: High load average with 5.3.3? [SOLVED]

Posted: Mon Dec 05, 2016 11:34 am
by cbeattie-unitrends
The culprit turned out to be php-mcrypt 5.4.16-7.el7 from CentOS 7's epel repository. After I ran 'yum downgrade php-mcrypt' and reverted to 5.4.16-4.el7 from the nagiosxi-deps repository, the CPU load average went back to normal.

After that, I also reverted php-mssql just to be on the safe side and keep its version numbers the same as php-mcrypt's.

Re: High load average with 5.3.3?

Posted: Mon Dec 05, 2016 4:05 pm
by dwhitfield
It looks like you marked your last post as [SOLVED]. Is it okay if we lock this thread? Thanks for choosing the Nagios forums!