Nagios Support Forum

Posted: **Thu Jul 17, 2014 12:12 pm**

Yeah Tyler, I read your post, great detail in it. I may go that route in the future as my environment grows. I will be have a heck of a lot of checks that are all perl. We shall see what happens.

Posted: **Thu Jul 17, 2014 12:31 pm**

tylerhoadley wrote:Well, an IP confliction could do that, seen DNS controllers reboot cause hugh spikes too because it clogs the queue execution and hope that this resolved your issue.

However I posted my finding under your thread for general public reading in the customer forum and potentially help you and others. If you see my other thread, you will see the high load peaks, and perl is my culprit, I know this now and the fact we use it intensively on our network gear and webinjects which caused bottlenecks in the native nagios queue. the load picture is evidence enough as well as the gearman_top queue visual cmd... and just prior to the embedded perl mark, it was still a bit high before I flipped it in /etc/mod_gearman/mod_gearman_worker.conf to use embedded perl again. that load pic is over 5 weeks time.

Hope this helps anyone out there with this type of setup and using embedded perl in nagios core 3 or XI 2012R*.* because nagios 4 doesn't have this function anymore and WILL cause you grief and time troubleshooting to find this info out... 5 weeks later, my system is stable and running better than before.

Cheers,

I don't think that this is the cause of the load spikes for the other posters here including me. Didn't used embedded perl before the upgrade to 2014. The spikes appear every couple of hours so it makes no sense that this has something to do with execution of perl scripts since those scripts also get executed between the spikes without causing the issue.

Posted: **Thu Jul 17, 2014 1:10 pm**

Could be right, but the fact that mod_gearman itself has stop the 25+/25+/25+ load spikes is a pretty good workaround until the main load spike problem is resolved. perl scripts as well as others add more processing and time to wait for and on the cpu. I've just removed all my checks from running from nagios native queue to mod_gearman (expect localhost; _WORKER=local)

I also re-read this thread in full and changed the nom_checkpoint_interval parameter from 1440 to 90 as instructed.

I've lowered httpd.conf defaults (RH6.5) (1.5G at most available, with 20 childs)
I've enabled memcached (128mb cache with 5 sec retention)
I've ran full repair on both nagios and nagiosql databases. (online and once offline when implemented mod_gearman over weekend's change window)
I've cleaned up all configuration files including adding a dumby contact that emails no one on certain info checks so that when nagios restarts, produces no errors or warnings of any type. no dups either. clean output

We run 664 hosts checks, 3349 service checks. using just about any and ever script available including some that I have written or modified to fit my environment. I have WAN packet dropping (another issue the network team is and has been working on with IPS for months now), in which I've written wrapper scripts to coup with the snmp no data returns... there is a lot going on this single system and this upgrade to spikes was not in my planning or I would have waited as some would now today (if knowing)

We also have mk_livestatus broker_module for nagvis and check_iftraff.pl (modified) for bandwidth measurements and now mod_gearman for nagios4 (with yet a bug too "\n")

the main system resides in a HA vsphere/vcenter 5.5 environment, with 2vcpu (although never used more than 1) and 4G ram... vmstat reports 3.3G in use... I pushed the 400mb of swap to memory too over weekend and seen some IO improvements (swappoff -a && swappon -a)

One other item to point out... when I upgraded... I upgrade the RH packages and kernel first... and rebooted, then upgraded XI... system has remained online since and in the past has been flawless without a second reboot

kernel = 2.6.32-431.17.1.el6.x86_64 #1 SMP

and of course errata for this kernel https://rhn.redhat.com/errata/RHSA-2014-0475.html

Not sure anymore... just need a nagiosXI system that can run at a level pace and mod_gearman has at least reduced the really high spikes.

ADDED:
the real kicker is the fact I run a personal Free XI on my home hypervisor for my development and some external viewing checks of clients sites I also work for and run XI internal for them... nothing spikes there on CentOS6.5. its 1vcpu, 2G ram and has 7hosts and 350 service checks.

kernel 2.6.32-431.11.2.el6.x86_64 #1 SMP

anyone out there run 2014 on CENTOS that has this type of environment setup and no load spikes? or does this spike headache relate to everyone on 2014 (RH or CentOS)?

Posted: **Thu Jul 17, 2014 6:28 pm**

I am running CentOS 6.5 with the source tarball installed. My hypervisor is Hyper-V 2012 though, and I've never seen an issue yet.

Posted: **Thu Jul 17, 2014 8:47 pm**

XI Ent Ed 2014R1.2

Sounds familiar.

I am using check_nwc_health, check_oracle_health, check_mssql_health, check_mysql_health which is all based on Perl.
Sometimes it is fine but periodically it shoots up the CPU.

Got fed up and install mod_gearman running locally using 100 workers.
So far no issues in CPU spikes.

Currently monitoring 1245 Hosts with 9078 active services check every 5 minutes.

Code: Select all

2014-07-18 09:41:08  -  localhost:4730   -  v0.25

 Queue Name           | Worker Available | Jobs Waiting | Jobs Running
-----------------------------------------------------------------------
 check_results        |               1  |           0  |           0
 eventhandler         |             100  |           0  |           0
 host                 |             100  |           0  |           2
 service              |             100  |           0  |          82
 worker_nagiosprodxi1 |               1  |           0  |           0
-----------------------------------------------------------------------

18-07-2014 09-44-54 AM.png

Posted: **Fri Jul 18, 2014 10:15 am**

rajasegar, thanks for the feedback!

Posted: **Fri Jul 18, 2014 10:58 am**

I just talked to Eric[0] about this issue. His commit to fix the issue resolved the problems on his test and production boxes. It originally looked like the fix was not working for most people. After a more in-depth discussion, once the patch is applied you have to stop nagios and either:
1. Change all next check times (in retention.dat) to a time in the past
2. Delete retention.dat completely
3. Wait until all next check times are in the past.
Only then should you start the nagios process.

Bandit, If I recall, you tried the patch and waited some time, did you not?

Also, the 7 hour spike may be a different cause than the 1.75 hour spike. The core devs are still working on it.

Posted: **Fri Jul 18, 2014 11:06 am**

Andy,

I did wait some time until they were all in the past. However, that commit that I tried was scheduling stuff nowhere near their desired check time. A one hour check would be scheduled for 30-40 minutes instead of 60 and 5 minute checks were being scheduled for 7-10 minutes.

Posted: **Fri Jul 18, 2014 11:37 am**

From my understanding of this, it really should have only done that once as the patch initially spaces out the checks to attempt to smooth out the check stacking when you are near timeperiod start/end times. I will try to pull Eric[0] into the thread today.

Posted: **Fri Jul 18, 2014 12:53 pm**

abrist wrote:From my understanding of this, it really should have only done that once as the patch initially spaces out the checks to attempt to smooth out the check stacking when you are near timeperiod start/end times. I will try to pull Eric[0] into the thread today.

Andy, that is not what I saw when testing.

Nagios Support Forum

CPU Load Spike daily

Re: CPU Load Spike daily

Re: CPU Load Spike daily

Re: CPU Load Spike daily

Re: CPU Load Spike daily

Re: CPU Load Spike daily

Re: CPU Load Spike daily

Re: CPU Load Spike daily

Re: CPU Load Spike daily

Re: CPU Load Spike daily

Re: CPU Load Spike daily