Load spike question
Posted: Tue Feb 25, 2014 6:17 pm
Question 2 for the new year!
So we're still working on our Mod Gearman deployment & for the most part things are working quite well.
Overview:
Fresh Nagios XI Deployment with Mod Gearman. 1 Master node and 3 worker nodes. All VM's on ESX5i: 4 x cpu, 4 gig mem & loads of disk, RHEL6 64bit.
Current issue we have is that bang on 21:30hrs each night, all of our Nagios XI servers endure a load spike for about 5 mins, with the 1min avg peaking upto 30..
disk I/O goes through the roof too - cant remember the numbers but I/O wait under cpu stats in the server statistics applet goes red..
Checks to the workers start timing out for the duration. But recover after the load comes back to normal.
Have been trying to find the culprit & focusing on mysql, but have not been able to identify the rogue element. iotop shows a service - jbd2 that peaks at 99%, as far as I can tell, that's to do with journalling & probably symptomatic.
So really the question is: Is there any Nagios specific maintenance that runs right at 21:30?
/var/log/cron shows specifically for 21:30:
Feb 25 21:30:01 ulpnag011 CROND[24646]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/nom.php > /usr/local/nagiosxi/var/nom.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24647]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php > /usr/local/nagiosxi/var/sysstat.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24648]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/deadpool.php > /usr/local/nagiosxi/var/deadpool.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24649]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/reportengine.php > /usr/local/nagiosxi/var/reportengine.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24653]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Feb 25 21:30:01 ulpnag011 CROND[24656]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php > /usr/local/nagiosxi/var/feedproc.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24657]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php > /usr/local/nagiosxi/var/eventman.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24655]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php > /usr/local/nagiosxi/var/cmdsubsys.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24659]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/cleaner.php > /usr/local/nagiosxi/var/cleaner.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24661]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php > /usr/local/nagiosxi/var/perfdataproc.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24660]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php > /usr/local/nagiosxi/var/dbmaint.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24654]: (root) CMD (LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg --lock-file /var/lock/mrtg/mrtg_l --confcache-file /var/lib/mrtg/mrtg.ok)
Running these manually didn't actually replicate the issue. And certainly they all run repeatedly anyway without incident.
Currently am chasing our ESX & Storage groups as well for some insight.
Appreciate any advice
regards
Lincoln
So we're still working on our Mod Gearman deployment & for the most part things are working quite well.
Overview:
Fresh Nagios XI Deployment with Mod Gearman. 1 Master node and 3 worker nodes. All VM's on ESX5i: 4 x cpu, 4 gig mem & loads of disk, RHEL6 64bit.
Current issue we have is that bang on 21:30hrs each night, all of our Nagios XI servers endure a load spike for about 5 mins, with the 1min avg peaking upto 30..
disk I/O goes through the roof too - cant remember the numbers but I/O wait under cpu stats in the server statistics applet goes red..
Checks to the workers start timing out for the duration. But recover after the load comes back to normal.
Have been trying to find the culprit & focusing on mysql, but have not been able to identify the rogue element. iotop shows a service - jbd2 that peaks at 99%, as far as I can tell, that's to do with journalling & probably symptomatic.
So really the question is: Is there any Nagios specific maintenance that runs right at 21:30?
/var/log/cron shows specifically for 21:30:
Feb 25 21:30:01 ulpnag011 CROND[24646]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/nom.php > /usr/local/nagiosxi/var/nom.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24647]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php > /usr/local/nagiosxi/var/sysstat.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24648]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/deadpool.php > /usr/local/nagiosxi/var/deadpool.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24649]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/reportengine.php > /usr/local/nagiosxi/var/reportengine.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24653]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Feb 25 21:30:01 ulpnag011 CROND[24656]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php > /usr/local/nagiosxi/var/feedproc.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24657]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php > /usr/local/nagiosxi/var/eventman.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24655]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php > /usr/local/nagiosxi/var/cmdsubsys.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24659]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/cleaner.php > /usr/local/nagiosxi/var/cleaner.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24661]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php > /usr/local/nagiosxi/var/perfdataproc.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24660]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php > /usr/local/nagiosxi/var/dbmaint.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24654]: (root) CMD (LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg --lock-file /var/lock/mrtg/mrtg_l --confcache-file /var/lib/mrtg/mrtg.ok)
Running these manually didn't actually replicate the issue. And certainly they all run repeatedly anyway without incident.
Currently am chasing our ESX & Storage groups as well for some insight.
Appreciate any advice
regards
Lincoln