ndo2db Hogging ALL the CPU

mrochelle · Post by **mrochelle** » Mon Sep 15, 2014 1:13 pm

I have a very similar problem and was curious to the resolution found for this problem?
Marcus

sreinhardt · Post by **sreinhardt** » Mon Sep 15, 2014 3:57 pm

Couple of the items that people have found are:

giant tables for alerts or logs, truncating tables should resolve this.
stalking might be turned on for hosts\services (fairly uncommon), this will generate tons of logs and alerts, quickly causing issues with both nagios and ndo. Removing stalking would resolve that.
offloading the db is a resolve on its own that often solves the issue entirely
One final option, would be to offload ndo2db as well. Nagios and ndo can talk just fine over tcp sockets, and is fully configurable within the ndo2db.cfg and ndomod.cfg files as needed.

Unfortunately this is somewhat unique to each install.

Post by **mikew** » Tue Sep 16, 2014 6:17 am

Update: the problem still exists and is creating great concen over resource usage for customer.

Just to verify stalking is not on. Offloading is of course and option but I wish we knew the cause as the customer does not want to offload the database or ndo2db.

Question:
What would cause giant tables for alerts or logs and what would be the specific process to truncate the tables?

Ultimately, I really want to know what has caused this issue as it is not happening with many other installs that I have seen.

mrochelle · Post by **mrochelle** » Tue Sep 16, 2014 7:31 am

I have to agree with Mike, of the 28 Nagios production servers I'm managing, only 2 seem to have this problem. However, let me add that after the last update to 2014R1.4 the frequency of occurrence of this issue dropped substantially. ( From 2 or 3 times a week per each server to 1 or twice every 2 weeks.) On one server I can see in the graph of scheduled events over time the checks slowly move toward all occurring at once but are auto reset and spread evenly. However, if conditions are just right during the 1 or 2 times it occurs, it is not able to recover and a restart of Nagios with the option "use_retained_scheduling_info=0" will recover the server. Based on this experience, I tend to believe there may be some potential glitch with the code that handles auto scheduling or that spreads the scheduling of checks evenly. ( I welcome any recommended tweaks to test this hypothesis?)
I did open a support case on this problem previously, and it was resolved at the time by rolling back to an early backup archive.
I will make my system available anytime to any nagios support personnel should they desire to investigate further.
Marcus

Post by **lmiltchev** » Tue Sep 16, 2014 12:38 pm

We've noticed scheduling issues with some customers, who had check interval set very low (1 or 2 min). Checks would get pushed forward (rescheduled), and the last check would not update. This was not necessarily accompanied by high load though. We were able to recreate the issue in house. Our developers are looking into this, but for now, here's what can be done as a "workaround".

1. Make sure that the "auto_rescheduling_window" is set LOWER than the smallest check interval.

For example, if your check interval is 1 min, you can set "auto_rescheduling_window" in the nagios.cfg to 45 sec.

Code: Select all

auto_rescheduling_window=45

2. Make sure that "auto_rescheduling_interval" is lower than auto_rescheduling_window. For example:

Code: Select all

auto_reschedule_checks=1
auto_rescheduling_interval=30
auto_rescheduling_window=45

This may fix the rescheduling issues in Nagios Core 4 when check interval is set low.

These issues may or may not be related but I would appreciate any feedback from people who tried this.

mrochelle · Post by **mrochelle** » Tue Sep 16, 2014 1:15 pm

Thanks for the feedback. I have made the changes and will keep you posted. Also while load was really not a problem with my experience, I did limit the number of checks below 5 minutes to a very small percent and I did have a noticeable decrease in load.
Marcus

Post by **mikew** » Tue Sep 16, 2014 1:18 pm

I reviewed the auto rescheduling and it is all well under the lowest. The loest is 5 min (300) and the auto_rescheduling_interval is 180.

So that does not look like a fit.

abrist · Post by **abrist** » Tue Sep 16, 2014 4:52 pm

Mike, do you still need help truncating tables or at least identifying if they have grown too large?

Post by **lmiltchev** » Tue Sep 16, 2014 4:58 pm

I guess this is a totally separate issue then.

Mike, you meant:

Code: Select all

auto_rescheduling_window=180

not

Code: Select all

auto_rescheduling_interval=180

correct?

Can you show us these three lines?

Code: Select all

auto_reschedule_checks=
auto_rescheduling_interval=
auto_rescheduling_window=

Post by **mikew** » Wed Sep 17, 2014 12:50 pm

auto_reschedule_checks=1
auto_rescheduling_interval=45
auto_rescheduling_window=180

Yes I would interested to see if I can truncate the tables or at least what could I test to see if that was an issue.

Nagios Support Forum

ndo2db Hogging ALL the CPU

Re: ndo2db Hogging ALL the CPU

Re: ndo2db Hogging ALL the CPU

Re: ndo2db Hogging ALL the CPU

Re: ndo2db Hogging ALL the CPU

Re: ndo2db Hogging ALL the CPU

Re: ndo2db Hogging ALL the CPU

Re: ndo2db Hogging ALL the CPU

Re: ndo2db Hogging ALL the CPU

Re: ndo2db Hogging ALL the CPU

Re: ndo2db Hogging ALL the CPU