Re: [Nagios-devel] Nagios 3.1.1 eats cpu like mad

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] Nagios 3.1.1 eats cpu like mad

Post by Guest »

Hiren Patel wrote:
> Ricardo Maraschini wrote:
>> I couldn't simulate the problem with a static configuration, so me try
>> to explain how I simulate the problem changing the timeperiod
>> configuration:
>>
>> 0. Create a service with active checks enabled scheduled to check
>> every 5 minutes
>>
>> 1. Associate this service with a timeperiod(initially it can be 24x7)
>>
>> 2. Wait until the service check and reschedule occur
>> Lets say that the check occurs at 10:00AM and the next check got
>> scheduled to 10:05AM
>>
>> 3. Stop nagios
>>
>> 4. Change your timeperiod configuration to invalidate the next service
>> check:
>> Using the above example, you change the service timeperiod
>> configuration to check only from 10:07AM to 24:00. The important thing
>> to simulate the problem is that the next service schedule
>> check(10:10AM) remains valid.
>>
>> 5. Start nagios
>>
>> 6. Wait until the previous scheduled service(10:05AM) occurs.
>>
>> The behaviour will change acording to your nagios version. On previous
>> versions the service is scheduled to next year, on the latest stable
>> release it is scheduled to next week and a message is print in log files.
>>
>> Below you can see an email sent by me in April 2nd about the same
>> issue, it can be useful.
>> Good luck, if you need any other info, please let me know.
>>
>
> thank you kindly for the explanation above on how to simulate the issue,
> I was able to simulate it using exactly the steps you mentioned.
> for me the problem is again the function that gets the next valid time,
> it returns void so there's no chance of getting an error return value
> from it, but it also sets the next valid time to the preferred time on
> two conditions, one being the preferred time is valid, the other being
> it can't find a good next valid time. I think this function needs to
> return int, and either OK or ERROR separating the two conditions above.
> in any case, the changes you suggested were problematic in one way, the
> run_async_service_check function can return error on a few occasions,
> not limited to the time being invalid. one such condition could be
> dependency constraints, now if we used current_time to get the next
> valid time for such a case, it would return current_time right back, so
> nagios will schedule that check right away, and when run again, loop in
> the same manner over and over. this I suspect caused the cpu eating seen
> with that diff.
> please test the attached diff if you don't mind. anyone else with
> better/bigger test environments than me could also try this, to see that
> it does not eat cpu like it was.
> I'd consider this a workaround and that the function be fixed long term.
>

I replicated the bug and have just posted a fix to CVS. The logic was
bad either due to recent timeperiod check logic changes, or since the
dawn of 3.x check logic redesign.

I wasn't able to replicate any CPU hogging, so I'm not sure if that is a
separate issue that needs to be fixed elsewhere.


- Ethan Galstad





This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked