Flap and Retain status issues with the service
Flap and Retain status issues with the service
Hello XI Support
We are having the circumstance where XI retains FAIL status (console) for hours until we re-run check immediately (manually) / OK right away
Problem is that during these hours checks are also OK (visibly - Selenium & Firefox)
- checks are ACTIVE and run every 20 minutes. 2 retries every 10.
- Flap set to skip
- Retain set to skip
- Obsess set to skip
All checks in question would be check_selenium cases.
Anything we can do around Flap/Retain/Obsess to make sure XI does not get stuck with the event code from many hours ago?
Profile attached (zip)
Service status picture attached - essentially is If am to rerun this one "force immediate check" / status will change to OK (even so that it already runs OK every 20 minutes)
EDIT: profile removed as it may contain sensitive data.
We are having the circumstance where XI retains FAIL status (console) for hours until we re-run check immediately (manually) / OK right away
Problem is that during these hours checks are also OK (visibly - Selenium & Firefox)
- checks are ACTIVE and run every 20 minutes. 2 retries every 10.
- Flap set to skip
- Retain set to skip
- Obsess set to skip
All checks in question would be check_selenium cases.
Anything we can do around Flap/Retain/Obsess to make sure XI does not get stuck with the event code from many hours ago?
Profile attached (zip)
Service status picture attached - essentially is If am to rerun this one "force immediate check" / status will change to OK (even so that it already runs OK every 20 minutes)
EDIT: profile removed as it may contain sensitive data.
You do not have the required permissions to view the files attached to this post.
Re: Flap and Retain status issues with the service
For a quick test, try changing the Retry Interval to 9 minutes and see if that makes the issue go away.
Post back what your findings after doing that change.
Post back what your findings after doing that change.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Flap and Retain status issues with the service
tgriep wrote:For a quick test, try changing the Retry Interval to 9 minutes and see if that makes the issue go away.
Post back what your findings after doing that change.
This is done via template - changed that.
No, not resolved, same problems happening again
Re: Flap and Retain status issues with the service
Thanks for posting back.
I noticed that you are running Mod Gearman on the XI system and if the gearman server is sending the check to a worker that cannot run the check and forcing the check it sends it to the correct worker, that could cause the issue you are seeing.
Can you verify that the gearman server / workers are setup correctly for this service?
If the settings are correct, I would have to see the following files from the XI server the next time it happens as well and the Gearman server and workers configuration files.
Thanks
I noticed that you are running Mod Gearman on the XI system and if the gearman server is sending the check to a worker that cannot run the check and forcing the check it sends it to the correct worker, that could cause the issue you are seeing.
Can you verify that the gearman server / workers are setup correctly for this service?
If the settings are correct, I would have to see the following files from the XI server the next time it happens as well and the Gearman server and workers configuration files.
Code: Select all
/usr/local/nagios/var/nagios.log
/usr/local/nagios/var/status.datBe sure to check out our Knowledgebase for helpful articles and solutions!
Re: Flap and Retain status issues with the service
All of these would be excluded from Gearman / all runs locally on XI because these are check_selenium services.tgriep wrote:Thanks for posting back.
I noticed that you are running Mod Gearman on the XI system and if the gearman server is sending the check to a worker that cannot run the check and forcing the check it sends it to the correct worker, that could cause the issue you are seeing.
Can you verify that the gearman server / workers are setup correctly for this service?
If the settings are correct, I would have to see the following files from the XI server the next time it happens as well and the Gearman server and workers configuration files.ThanksCode: Select all
/usr/local/nagios/var/nagios.log /usr/local/nagios/var/status.dat
Wonder if maybe check_selenium misinterprets RC 2.53 session results ?
Re: Flap and Retain status issues with the service
It is hard to say if the plugin misinterprets the result.
Are you running the latest versions of the Selenium software?
http://devops-abyss.blogspot.com/2010/0 ... agios.html
Are you running the latest versions of the Selenium software?
http://devops-abyss.blogspot.com/2010/0 ... agios.html
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Flap and Retain status issues with the service
Nagios integration cannot run latest version 3.0.1(missing 3rd party CPAN library), but latest for RC: 2.53.1 - yes.tgriep wrote:It is hard to say if the plugin misinterprets the result.
Are you running the latest versions of the Selenium software?
http://devops-abyss.blogspot.com/2010/0 ... agios.html
I wonder if indeed "ERROR Server Exception: sessionId should not be null: has this session been started yet?" check_selenium interprets as CRITICAL, where in fact it should be UNKNOWN (how to change?)
Re: Flap and Retain status issues with the service
You would have to ask the developer of the plugin for those changes.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Flap and Retain status issues with the service
You are the developertgriep wrote:You would have to ask the developer of the plugin for those changes.
Re: Flap and Retain status issues with the service
I was talking about the plugin changes but is you want me to look in to why that check works when you force the check, I will need to see the following files.
Code: Select all
/usr/local/nagios/var/nagios.log
/usr/local/nagios/var/status.datBe sure to check out our Knowledgebase for helpful articles and solutions!