Page 2 of 2

Re: High Load Average

Posted: Tue Jun 20, 2017 2:55 pm
by dwhitfield
SteveBeauchemin wrote: Any IO wait states are also bad. Some backup systems steal all your IO cycles.
Not that the rest wasn't the usual brilliance, but I wanted to single this out, because we see backup systems cause issues from time to time.

Depending on your network speeds, you may not want to offload your db. Just a note about Steve's "all the things" comment. :)

Re: High Load Average

Posted: Mon Jun 26, 2017 7:49 am
by cbeattie-unitrends
I spun up two new Nagios servers with RAM disks, one Core and one XI. The Core instance is on a 6 CPU VM running the full set of 30K checks at a 3-4 load average. The XI instance is a VM with 12 CPUs, with the first XI server's config backed up, imported, and half the hosts deleted. It's got about 400 hosts, just over 12K services, and it's still running at a 20 load average.

So, whatever problem I have migrated to a new server, doesn't appear to be strictly tied to the number of host and service checks, but does appear to exist in the XI portion instead of both XI and Core.

Re: High Load Average

Posted: Mon Jun 26, 2017 9:06 am
by dwhitfield
Based on our hardware requirements doc, that 12 Core is a little under-powered: https://assets.nagios.com/downloads/nag ... ements.pdf

3k H+S = 4 CPU
12k H+S ~= 16CPU

Is your database offloaded? Depending on network speeds, off-loading could be a good or bad thing, but certainly that's a piece that XI has that Core does not have.

Re: High Load Average

Posted: Thu Jun 29, 2017 8:59 am
by cbeattie-unitrends
I brushed up on using strace to better understand what it was telling me. When I started Nagios with strace watching, I saw that there were a lot of write syscall errors. Digging deeper, I found it was when XI was attempting to do something with messaging-enabled users.

Code: Select all

write(9, "job_id=290\0type=1\0command=/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_notification.php --notification-type=service --contact=\"jbaucom\" --contactemail=\"[email protected]\" --type=DOWN"..., 826) = -1 EAGAIN (Resource temporarily unavailable)
write(9, "job_id=290\0type=1\0command=/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_notification.php --notification-type=service --contact=\"jbaucom\" --contactemail=\"[email protected]\" --type=DOWN"..., 826) = -1 EAGAIN (Resource temporarily unavailable)
write(9, "job_id=290\0type=1\0command=/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_notification.php --notification-type=service --contact=\"jbaucom\" --contactemail=\"[email protected]\" --type=DOWN"..., 826) = -1 EAGAIN (Resource temporarily unavailable)
I tried renaming /usr/local/nagiosxi/scripts/handle_nagioscore_notification.php, but that didn't change anything. So I tried a more drastic approach of deleting all the messaging-enabled XI users. strace isn't showing any of those write syscall errors any more, and Nagios seems much happier.

Code: Select all

top - 07:53:18 up 17 min,  2 users,  load average: 4.32, 4.21, 3.89
Tasks: 281 total,   9 running, 272 sleeping,   0 stopped,   0 zombie
%Cpu(s): 34.7 us,  6.5 sy,  0.0 ni, 56.5 id,  2.1 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem : 32931056 total, 29410352 free,  1014112 used,  2506592 buff/cache
KiB Swap: 16515068 total, 16515068 free,        0 used. 31374904 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
112433 nagios    20   0  159404  11964   2496 S  68.8  0.0   0:00.11 check_snmp_proc
  1928 apache    20   0  642836  32728   5944 S  62.5  0.1   0:10.82 httpd
112438 nagios    20   0  159276  11740   2436 S  43.8  0.0   0:00.07 check_snmp_stor
[root@den-nagios ~]#
The next step will be to add messaging-enabled contacts back and see what happens, but I'm going to let Nagios catch up on its work before I poke it again.

As far as being overloaded goes, I wanted to know that I had a clean, working configuration to export to multiple boxes before I started cutting checks out.

Re: High Load Average

Posted: Thu Jun 29, 2017 2:14 pm
by tgriep
When Nagios detects that is has to send an email, it writes the information to the MYSQL database and another process read that info and sends it on, if the Message Queue is full or that the system could not write to the database, that could be another cause of the issue.

Another thing I found in your profile, the Time Zone settings do not match and that should be fixed.

Code: Select all

===Date/Time====

PHP Timezone: US/Central 
PHP Time: Fri, 16 Jun 2017 15:29:05 -0500
System Time: Fri, 16 Jun 2017 14:29:05 -0600
Follow this document to get the PHP time and the System time in sync.
https://assets.nagios.com/downloads/nag ... m_Time.pdf