Load spike question

lance · Post by **lance** » Tue Feb 25, 2014 6:17 pm

Question 2 for the new year!

So we're still working on our Mod Gearman deployment & for the most part things are working quite well.

Overview:
Fresh Nagios XI Deployment with Mod Gearman. 1 Master node and 3 worker nodes. All VM's on ESX5i: 4 x cpu, 4 gig mem & loads of disk, RHEL6 64bit.

Current issue we have is that bang on 21:30hrs each night, all of our Nagios XI servers endure a load spike for about 5 mins, with the 1min avg peaking upto 30..

spike.png

disk I/O goes through the roof too - cant remember the numbers but I/O wait under cpu stats in the server statistics applet goes red..

Checks to the workers start timing out for the duration. But recover after the load comes back to normal.

Have been trying to find the culprit & focusing on mysql, but have not been able to identify the rogue element. iotop shows a service - jbd2 that peaks at 99%, as far as I can tell, that's to do with journalling & probably symptomatic.

So really the question is: Is there any Nagios specific maintenance that runs right at 21:30?

/var/log/cron shows specifically for 21:30:

Feb 25 21:30:01 ulpnag011 CROND[24646]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/nom.php > /usr/local/nagiosxi/var/nom.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24647]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php > /usr/local/nagiosxi/var/sysstat.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24648]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/deadpool.php > /usr/local/nagiosxi/var/deadpool.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24649]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/reportengine.php > /usr/local/nagiosxi/var/reportengine.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24653]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Feb 25 21:30:01 ulpnag011 CROND[24656]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php > /usr/local/nagiosxi/var/feedproc.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24657]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php > /usr/local/nagiosxi/var/eventman.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24655]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php > /usr/local/nagiosxi/var/cmdsubsys.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24659]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/cleaner.php > /usr/local/nagiosxi/var/cleaner.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24661]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php > /usr/local/nagiosxi/var/perfdataproc.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24660]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php > /usr/local/nagiosxi/var/dbmaint.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24654]: (root) CMD (LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg --lock-file /var/lock/mrtg/mrtg_l --confcache-file /var/lib/mrtg/mrtg.ok)

Running these manually didn't actually replicate the issue. And certainly they all run repeatedly anyway without incident.

Currently am chasing our ESX & Storage groups as well for some insight.

Appreciate any advice

regards

Lincoln

slansing · Post by **slansing** » Wed Feb 26, 2014 10:22 am

Is it taking you about that amount of time to get through apply configurations? Is someone pulling a report from over a year ago? What are the hardware specs of the XI server? And is it on a SAN, or NAS?

lance · Post by **lance** » Wed Feb 26, 2014 6:15 pm

Hi,

Nah - we're not doing any config changes that time of night or reporting (weve got NDOUTILS state history maintained for 30 days, the rest under a week.) & when we do it takes about 10-15 sec to apply.

All the machines are VM's, with the virtual disk disk hosted on a SAN.

hardware Specs:

VMWARE ESX5i:
RHEL6 64bit.
4 x cpu; Intel(R) Xeon(R) CPU E7- 2870 @ 2.40GHz
4 gig mem
loads of disk,

As a troubleshooting measure we moved one of the VM's off the SAN disk onto the local host disk. But unfortunately we saw the same behaviour:

spike2.png

thanks

Lincoln

Post by **lmiltchev** » Thu Feb 27, 2014 12:11 pm

Is there a possibility that something else is run around the same time (21:30) - backups, etc.?

lance · Post by **lance** » Fri Feb 28, 2014 6:19 am

Hi,

I'm told that backups do certainly start at that time across the SAN and we're following that up with the relevant support teams. We're keeping an eye on the virtual guest that we move to the local disk of a ESX host to see if we get the same symptoms off the storage infrastructure.

Just wanted to confirm if there was any Nagios specific tasks kicking off at that time, which there doesn't seem to be.

Will keep you posted!

scottwilkerson · Post by **scottwilkerson** » Fri Feb 28, 2014 8:45 am

There isn't , but what you describe would be expected if there was high IO wait across the SAN at that time as Nagios XI needs a quite a bit of IO to keep running smoothly.

lance · Post by **lance** » Fri Feb 28, 2014 5:55 pm

Thats quite handy info thanks for that! Something I can take back to the Storage/ESX teams to assist us with.

regards

Lincoln

slansing · Post by **slansing** » Mon Mar 03, 2014 11:03 am

Excellent, let us know if you have further questions.

Nagios Support Forum

Load spike question

Load spike question

Re: Load spike question

Re: Load spike question

Re: Load spike question

Re: Load spike question

Re: Load spike question

Re: Load spike question

Re: Load spike question