Flap and Retain status issues with the service

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
dlukinski
Posts: 1130
Joined: Tue Oct 06, 2015 9:42 am

Flap and Retain status issues with the service

Post by dlukinski »

Hello XI Support

We are having the circumstance where XI retains FAIL status (console) for hours until we re-run check immediately (manually) / OK right away
Problem is that during these hours checks are also OK (visibly - Selenium & Firefox)

- checks are ACTIVE and run every 20 minutes. 2 retries every 10.
- Flap set to skip
- Retain set to skip
- Obsess set to skip

All checks in question would be check_selenium cases.

Anything we can do around Flap/Retain/Obsess to make sure XI does not get stuck with the event code from many hours ago?
Profile attached (zip)

Service status picture attached - essentially is If am to rerun this one "force immediate check" / status will change to OK (even so that it already runs OK every 20 minutes)

EDIT: profile removed as it may contain sensitive data.
You do not have the required permissions to view the files attached to this post.
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Flap and Retain status issues with the service

Post by tgriep »

For a quick test, try changing the Retry Interval to 9 minutes and see if that makes the issue go away.
Post back what your findings after doing that change.
Be sure to check out our Knowledgebase for helpful articles and solutions!
dlukinski
Posts: 1130
Joined: Tue Oct 06, 2015 9:42 am

Re: Flap and Retain status issues with the service

Post by dlukinski »

tgriep wrote:For a quick test, try changing the Retry Interval to 9 minutes and see if that makes the issue go away.
Post back what your findings after doing that change.

This is done via template - changed that.

No, not resolved, same problems happening again
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Flap and Retain status issues with the service

Post by tgriep »

Thanks for posting back.
I noticed that you are running Mod Gearman on the XI system and if the gearman server is sending the check to a worker that cannot run the check and forcing the check it sends it to the correct worker, that could cause the issue you are seeing.
Can you verify that the gearman server / workers are setup correctly for this service?

If the settings are correct, I would have to see the following files from the XI server the next time it happens as well and the Gearman server and workers configuration files.

Code: Select all

/usr/local/nagios/var/nagios.log
/usr/local/nagios/var/status.dat
Thanks
Be sure to check out our Knowledgebase for helpful articles and solutions!
dlukinski
Posts: 1130
Joined: Tue Oct 06, 2015 9:42 am

Re: Flap and Retain status issues with the service

Post by dlukinski »

tgriep wrote:Thanks for posting back.
I noticed that you are running Mod Gearman on the XI system and if the gearman server is sending the check to a worker that cannot run the check and forcing the check it sends it to the correct worker, that could cause the issue you are seeing.
Can you verify that the gearman server / workers are setup correctly for this service?

If the settings are correct, I would have to see the following files from the XI server the next time it happens as well and the Gearman server and workers configuration files.

Code: Select all

/usr/local/nagios/var/nagios.log
/usr/local/nagios/var/status.dat
Thanks
All of these would be excluded from Gearman / all runs locally on XI because these are check_selenium services.
Wonder if maybe check_selenium misinterprets RC 2.53 session results ?
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Flap and Retain status issues with the service

Post by tgriep »

It is hard to say if the plugin misinterprets the result.
Are you running the latest versions of the Selenium software?
http://devops-abyss.blogspot.com/2010/0 ... agios.html
Be sure to check out our Knowledgebase for helpful articles and solutions!
dlukinski
Posts: 1130
Joined: Tue Oct 06, 2015 9:42 am

Re: Flap and Retain status issues with the service

Post by dlukinski »

tgriep wrote:It is hard to say if the plugin misinterprets the result.
Are you running the latest versions of the Selenium software?
http://devops-abyss.blogspot.com/2010/0 ... agios.html
Nagios integration cannot run latest version 3.0.1(missing 3rd party CPAN library), but latest for RC: 2.53.1 - yes.

I wonder if indeed "ERROR Server Exception: sessionId should not be null: has this session been started yet?" check_selenium interprets as CRITICAL, where in fact it should be UNKNOWN (how to change?)
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Flap and Retain status issues with the service

Post by tgriep »

You would have to ask the developer of the plugin for those changes.
Be sure to check out our Knowledgebase for helpful articles and solutions!
dlukinski
Posts: 1130
Joined: Tue Oct 06, 2015 9:42 am

Re: Flap and Retain status issues with the service

Post by dlukinski »

tgriep wrote:You would have to ask the developer of the plugin for those changes.
You are the developer :-) (Nagios team that is)
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Flap and Retain status issues with the service

Post by tgriep »

I was talking about the plugin changes but is you want me to look in to why that check works when you force the check, I will need to see the following files.

Code: Select all

/usr/local/nagios/var/nagios.log
/usr/local/nagios/var/status.dat
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked