I brushed up on using strace to better understand what it was telling me. When I started Nagios with strace watching, I saw that there were a lot of write syscall errors. Digging deeper, I found it was when XI was attempting to do something with messaging-enabled users.
Code: Select all
write(9, "job_id=290\0type=1\0command=/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_notification.php --notification-type=service --contact=\"jbaucom\" --contactemail=\"[email protected]\" --type=DOWN"..., 826) = -1 EAGAIN (Resource temporarily unavailable)
write(9, "job_id=290\0type=1\0command=/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_notification.php --notification-type=service --contact=\"jbaucom\" --contactemail=\"[email protected]\" --type=DOWN"..., 826) = -1 EAGAIN (Resource temporarily unavailable)
write(9, "job_id=290\0type=1\0command=/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_notification.php --notification-type=service --contact=\"jbaucom\" --contactemail=\"[email protected]\" --type=DOWN"..., 826) = -1 EAGAIN (Resource temporarily unavailable)
I tried renaming /usr/local/nagiosxi/scripts/handle_nagioscore_notification.php, but that didn't change anything. So I tried a more drastic approach of deleting all the messaging-enabled XI users. strace isn't showing any of those write syscall errors any more, and Nagios seems much happier.
Code: Select all
top - 07:53:18 up 17 min, 2 users, load average: 4.32, 4.21, 3.89
Tasks: 281 total, 9 running, 272 sleeping, 0 stopped, 0 zombie
%Cpu(s): 34.7 us, 6.5 sy, 0.0 ni, 56.5 id, 2.1 wa, 0.0 hi, 0.3 si, 0.0 st
KiB Mem : 32931056 total, 29410352 free, 1014112 used, 2506592 buff/cache
KiB Swap: 16515068 total, 16515068 free, 0 used. 31374904 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
112433 nagios 20 0 159404 11964 2496 S 68.8 0.0 0:00.11 check_snmp_proc
1928 apache 20 0 642836 32728 5944 S 62.5 0.1 0:10.82 httpd
112438 nagios 20 0 159276 11740 2436 S 43.8 0.0 0:00.07 check_snmp_stor
[root@den-nagios ~]#
The next step will be to add messaging-enabled contacts back and see what happens, but I'm going to let Nagios catch up on its work before I poke it again.
As far as being overloaded goes, I wanted to know that I had a clean, working configuration to export to multiple boxes before I started cutting checks out.