Page 1 of 1

Load spike question

Posted: Tue Feb 25, 2014 6:17 pm
by lance
Question 2 for the new year!

So we're still working on our Mod Gearman deployment & for the most part things are working quite well.

Overview:
Fresh Nagios XI Deployment with Mod Gearman. 1 Master node and 3 worker nodes. All VM's on ESX5i: 4 x cpu, 4 gig mem & loads of disk, RHEL6 64bit.

Current issue we have is that bang on 21:30hrs each night, all of our Nagios XI servers endure a load spike for about 5 mins, with the 1min avg peaking upto 30..
spike.png
disk I/O goes through the roof too - cant remember the numbers but I/O wait under cpu stats in the server statistics applet goes red..

Checks to the workers start timing out for the duration. But recover after the load comes back to normal.

Have been trying to find the culprit & focusing on mysql, but have not been able to identify the rogue element. iotop shows a service - jbd2 that peaks at 99%, as far as I can tell, that's to do with journalling & probably symptomatic.

So really the question is: Is there any Nagios specific maintenance that runs right at 21:30?

/var/log/cron shows specifically for 21:30:

Feb 25 21:30:01 ulpnag011 CROND[24646]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/nom.php > /usr/local/nagiosxi/var/nom.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24647]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php > /usr/local/nagiosxi/var/sysstat.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24648]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/deadpool.php > /usr/local/nagiosxi/var/deadpool.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24649]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/reportengine.php > /usr/local/nagiosxi/var/reportengine.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24653]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Feb 25 21:30:01 ulpnag011 CROND[24656]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php > /usr/local/nagiosxi/var/feedproc.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24657]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php > /usr/local/nagiosxi/var/eventman.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24655]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php > /usr/local/nagiosxi/var/cmdsubsys.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24659]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/cleaner.php > /usr/local/nagiosxi/var/cleaner.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24661]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php > /usr/local/nagiosxi/var/perfdataproc.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24660]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php > /usr/local/nagiosxi/var/dbmaint.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24654]: (root) CMD (LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg --lock-file /var/lock/mrtg/mrtg_l --confcache-file /var/lib/mrtg/mrtg.ok)

Running these manually didn't actually replicate the issue. And certainly they all run repeatedly anyway without incident.

Currently am chasing our ESX & Storage groups as well for some insight.

Appreciate any advice

regards

Lincoln

Re: Load spike question

Posted: Wed Feb 26, 2014 10:22 am
by slansing
Is it taking you about that amount of time to get through apply configurations? Is someone pulling a report from over a year ago? What are the hardware specs of the XI server? And is it on a SAN, or NAS?

Re: Load spike question

Posted: Wed Feb 26, 2014 6:15 pm
by lance
Hi,

Nah - we're not doing any config changes that time of night or reporting (weve got NDOUTILS state history maintained for 30 days, the rest under a week.) & when we do it takes about 10-15 sec to apply.

All the machines are VM's, with the virtual disk disk hosted on a SAN.

hardware Specs:

VMWARE ESX5i:
RHEL6 64bit.
4 x cpu; Intel(R) Xeon(R) CPU E7- 2870 @ 2.40GHz
4 gig mem
loads of disk,

As a troubleshooting measure we moved one of the VM's off the SAN disk onto the local host disk. But unfortunately we saw the same behaviour:
spike2.png

thanks

Lincoln

Re: Load spike question

Posted: Thu Feb 27, 2014 12:11 pm
by lmiltchev
Is there a possibility that something else is run around the same time (21:30) - backups, etc.?

Re: Load spike question

Posted: Fri Feb 28, 2014 6:19 am
by lance
Hi,

I'm told that backups do certainly start at that time across the SAN and we're following that up with the relevant support teams. We're keeping an eye on the virtual guest that we move to the local disk of a ESX host to see if we get the same symptoms off the storage infrastructure.

Just wanted to confirm if there was any Nagios specific tasks kicking off at that time, which there doesn't seem to be.

Will keep you posted!

Re: Load spike question

Posted: Fri Feb 28, 2014 8:45 am
by scottwilkerson
There isn't , but what you describe would be expected if there was high IO wait across the SAN at that time as Nagios XI needs a quite a bit of IO to keep running smoothly.

Re: Load spike question

Posted: Fri Feb 28, 2014 5:55 pm
by lance
Thats quite handy info thanks for that! Something I can take back to the Storage/ESX teams to assist us with.

regards

Lincoln

Re: Load spike question

Posted: Mon Mar 03, 2014 11:03 am
by slansing
Excellent, let us know if you have further questions.