Load spike question

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
lance
Posts: 38
Joined: Wed Feb 17, 2010 5:00 pm

Load spike question

Post by lance »

Question 2 for the new year!

So we're still working on our Mod Gearman deployment & for the most part things are working quite well.

Overview:
Fresh Nagios XI Deployment with Mod Gearman. 1 Master node and 3 worker nodes. All VM's on ESX5i: 4 x cpu, 4 gig mem & loads of disk, RHEL6 64bit.

Current issue we have is that bang on 21:30hrs each night, all of our Nagios XI servers endure a load spike for about 5 mins, with the 1min avg peaking upto 30..
spike.png
disk I/O goes through the roof too - cant remember the numbers but I/O wait under cpu stats in the server statistics applet goes red..

Checks to the workers start timing out for the duration. But recover after the load comes back to normal.

Have been trying to find the culprit & focusing on mysql, but have not been able to identify the rogue element. iotop shows a service - jbd2 that peaks at 99%, as far as I can tell, that's to do with journalling & probably symptomatic.

So really the question is: Is there any Nagios specific maintenance that runs right at 21:30?

/var/log/cron shows specifically for 21:30:

Feb 25 21:30:01 ulpnag011 CROND[24646]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/nom.php > /usr/local/nagiosxi/var/nom.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24647]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php > /usr/local/nagiosxi/var/sysstat.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24648]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/deadpool.php > /usr/local/nagiosxi/var/deadpool.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24649]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/reportengine.php > /usr/local/nagiosxi/var/reportengine.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24653]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Feb 25 21:30:01 ulpnag011 CROND[24656]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php > /usr/local/nagiosxi/var/feedproc.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24657]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php > /usr/local/nagiosxi/var/eventman.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24655]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php > /usr/local/nagiosxi/var/cmdsubsys.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24659]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/cleaner.php > /usr/local/nagiosxi/var/cleaner.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24661]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php > /usr/local/nagiosxi/var/perfdataproc.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24660]: (nagios) CMD (/usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php > /usr/local/nagiosxi/var/dbmaint.log 2>&1)
Feb 25 21:30:01 ulpnag011 CROND[24654]: (root) CMD (LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg --lock-file /var/lock/mrtg/mrtg_l --confcache-file /var/lib/mrtg/mrtg.ok)

Running these manually didn't actually replicate the issue. And certainly they all run repeatedly anyway without incident.

Currently am chasing our ESX & Storage groups as well for some insight.

Appreciate any advice

regards

Lincoln
You do not have the required permissions to view the files attached to this post.
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Load spike question

Post by slansing »

Is it taking you about that amount of time to get through apply configurations? Is someone pulling a report from over a year ago? What are the hardware specs of the XI server? And is it on a SAN, or NAS?
lance
Posts: 38
Joined: Wed Feb 17, 2010 5:00 pm

Re: Load spike question

Post by lance »

Hi,

Nah - we're not doing any config changes that time of night or reporting (weve got NDOUTILS state history maintained for 30 days, the rest under a week.) & when we do it takes about 10-15 sec to apply.

All the machines are VM's, with the virtual disk disk hosted on a SAN.

hardware Specs:

VMWARE ESX5i:
RHEL6 64bit.
4 x cpu; Intel(R) Xeon(R) CPU E7- 2870 @ 2.40GHz
4 gig mem
loads of disk,

As a troubleshooting measure we moved one of the VM's off the SAN disk onto the local host disk. But unfortunately we saw the same behaviour:
spike2.png

thanks

Lincoln
You do not have the required permissions to view the files attached to this post.
User avatar
lmiltchev
Former Nagios Staff
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Load spike question

Post by lmiltchev »

Is there a possibility that something else is run around the same time (21:30) - backups, etc.?
Be sure to check out our Knowledgebase for helpful articles and solutions!
lance
Posts: 38
Joined: Wed Feb 17, 2010 5:00 pm

Re: Load spike question

Post by lance »

Hi,

I'm told that backups do certainly start at that time across the SAN and we're following that up with the relevant support teams. We're keeping an eye on the virtual guest that we move to the local disk of a ESX host to see if we get the same symptoms off the storage infrastructure.

Just wanted to confirm if there was any Nagios specific tasks kicking off at that time, which there doesn't seem to be.

Will keep you posted!
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Load spike question

Post by scottwilkerson »

There isn't , but what you describe would be expected if there was high IO wait across the SAN at that time as Nagios XI needs a quite a bit of IO to keep running smoothly.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
lance
Posts: 38
Joined: Wed Feb 17, 2010 5:00 pm

Re: Load spike question

Post by lance »

Thats quite handy info thanks for that! Something I can take back to the Storage/ESX teams to assist us with.

regards

Lincoln
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Load spike question

Post by slansing »

Excellent, let us know if you have further questions.
Locked