Page 1 of 3

Flap and Retain status issues with the service

Posted: Fri Jan 20, 2017 9:37 am
by dlukinski
Hello XI Support

We are having the circumstance where XI retains FAIL status (console) for hours until we re-run check immediately (manually) / OK right away
Problem is that during these hours checks are also OK (visibly - Selenium & Firefox)

- checks are ACTIVE and run every 20 minutes. 2 retries every 10.
- Flap set to skip
- Retain set to skip
- Obsess set to skip

All checks in question would be check_selenium cases.

Anything we can do around Flap/Retain/Obsess to make sure XI does not get stuck with the event code from many hours ago?
Profile attached (zip)

Service status picture attached - essentially is If am to rerun this one "force immediate check" / status will change to OK (even so that it already runs OK every 20 minutes)

EDIT: profile removed as it may contain sensitive data.

Re: Flap and Retain status issues with the service

Posted: Fri Jan 20, 2017 2:22 pm
by tgriep
For a quick test, try changing the Retry Interval to 9 minutes and see if that makes the issue go away.
Post back what your findings after doing that change.

Re: Flap and Retain status issues with the service

Posted: Sun Jan 22, 2017 4:55 pm
by dlukinski
tgriep wrote:For a quick test, try changing the Retry Interval to 9 minutes and see if that makes the issue go away.
Post back what your findings after doing that change.

This is done via template - changed that.

No, not resolved, same problems happening again

Re: Flap and Retain status issues with the service

Posted: Mon Jan 23, 2017 11:34 am
by tgriep
Thanks for posting back.
I noticed that you are running Mod Gearman on the XI system and if the gearman server is sending the check to a worker that cannot run the check and forcing the check it sends it to the correct worker, that could cause the issue you are seeing.
Can you verify that the gearman server / workers are setup correctly for this service?

If the settings are correct, I would have to see the following files from the XI server the next time it happens as well and the Gearman server and workers configuration files.

Code: Select all

/usr/local/nagios/var/nagios.log
/usr/local/nagios/var/status.dat
Thanks

Re: Flap and Retain status issues with the service

Posted: Tue Jan 24, 2017 2:33 pm
by dlukinski
tgriep wrote:Thanks for posting back.
I noticed that you are running Mod Gearman on the XI system and if the gearman server is sending the check to a worker that cannot run the check and forcing the check it sends it to the correct worker, that could cause the issue you are seeing.
Can you verify that the gearman server / workers are setup correctly for this service?

If the settings are correct, I would have to see the following files from the XI server the next time it happens as well and the Gearman server and workers configuration files.

Code: Select all

/usr/local/nagios/var/nagios.log
/usr/local/nagios/var/status.dat
Thanks
All of these would be excluded from Gearman / all runs locally on XI because these are check_selenium services.
Wonder if maybe check_selenium misinterprets RC 2.53 session results ?

Re: Flap and Retain status issues with the service

Posted: Tue Jan 24, 2017 3:32 pm
by tgriep
It is hard to say if the plugin misinterprets the result.
Are you running the latest versions of the Selenium software?
http://devops-abyss.blogspot.com/2010/0 ... agios.html

Re: Flap and Retain status issues with the service

Posted: Tue Jan 24, 2017 3:52 pm
by dlukinski
tgriep wrote:It is hard to say if the plugin misinterprets the result.
Are you running the latest versions of the Selenium software?
http://devops-abyss.blogspot.com/2010/0 ... agios.html
Nagios integration cannot run latest version 3.0.1(missing 3rd party CPAN library), but latest for RC: 2.53.1 - yes.

I wonder if indeed "ERROR Server Exception: sessionId should not be null: has this session been started yet?" check_selenium interprets as CRITICAL, where in fact it should be UNKNOWN (how to change?)

Re: Flap and Retain status issues with the service

Posted: Tue Jan 24, 2017 4:40 pm
by tgriep
You would have to ask the developer of the plugin for those changes.

Re: Flap and Retain status issues with the service

Posted: Wed Jan 25, 2017 3:42 pm
by dlukinski
tgriep wrote:You would have to ask the developer of the plugin for those changes.
You are the developer :-) (Nagios team that is)

Re: Flap and Retain status issues with the service

Posted: Wed Jan 25, 2017 3:53 pm
by tgriep
I was talking about the plugin changes but is you want me to look in to why that check works when you force the check, I will need to see the following files.

Code: Select all

/usr/local/nagios/var/nagios.log
/usr/local/nagios/var/status.dat