Added zipped filestgriep wrote:I was talking about the plugin changes but is you want me to look in to why that check works when you force the check, I will need to see the following files.Code: Select all
/usr/local/nagios/var/nagios.log /usr/local/nagios/var/status.dat
Flap and Retain status issues with the service
Re: Flap and Retain status issues with the service
Last edited by tgriep on Fri Feb 03, 2017 10:44 am, edited 1 time in total.
Reason: Removed nagios.zip
Reason: Removed nagios.zip
Re: Flap and Retain status issues with the service
All of the settings in those files look good for that service and it looks like the current state of that check is OK.
The next time you see it go to a critical state and not recover, upload the same log files so we can check them when it is failing.
The next time you see it go to a critical state and not recover, upload the same log files so we can check them when it is failing.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Flap and Retain status issues with the service
Here you go: happening againtgriep wrote:All of the settings in those files look good for that service and it looks like the current state of that check is OK.
The next time you see it go to a critical state and not recover, upload the same log files so we can check them when it is failing.
Please look at service screenshot attached (VM, which runs it in GUI in Firefox, runs one OK, no issues. SELENIUM script starts and end OK)
Re: Flap and Retain status issues with the service
The screen shot is not attached to the post.
Can you attach it again as well as the log files is the service check is failing to restart?
These are the files we need to see when the service doesn't run the check when it is down.
Can you attach it again as well as the log files is the service check is failing to restart?
These are the files we need to see when the service doesn't run the check when it is down.
Code: Select all
/usr/local/nagios/var/nagios.log
/usr/local/nagios/var/status.datBe sure to check out our Knowledgebase for helpful articles and solutions!
Re: Flap and Retain status issues with the service
tgriep wrote:The screen shot is not attached to the post.
Can you attach it again as well as the log files is the service check is failing to restart?
These are the files we need to see when the service doesn't run the check when it is down.Code: Select all
/usr/local/nagios/var/nagios.log /usr/local/nagios/var/status.dat
You do not have the required permissions to view the files attached to this post.
-
avandemore
- Posts: 1597
- Joined: Tue Sep 27, 2016 4:57 pm
Re: Flap and Retain status issues with the service
Do the selenium checks generate a lot of output? What is the output of:
There is a known issue with Core and large outputs fixed in XI 5.4.x.
Code: Select all
# /usr/local/nagios/bin/nagios -vPrevious Nagios employee
Re: Flap and Retain status issues with the service
This forum may be of use. Have we checked the Selenium Logs yet?
https://github.com/seleniumhq/selenium- ... ssues/2716
https://github.com/seleniumhq/selenium- ... ssues/2716
Be sure to check out the Knowledgebase for helpful articles and solutions!
Re: Flap and Retain status issues with the service
If we could also see the outputs from those Selenium checks from the CLI, that might be helpful. I know the IDE's default timeout is pretty low which might be causing the false-negatives.
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
Re: Flap and Retain status issues with the service
OK, because these issues come and go I am to wait until we upgrade to 5.4.2 and review again after that.mcapra wrote:If we could also see the outputs from those Selenium checks from the CLI, that might be helpful. I know the IDE's default timeout is pretty low which might be causing the false-negatives.
Here is the one just bounced back from the same exception:
Code: Select all
[root@fikc-nagxiprod01 libexec]# perl selenium-alfresco-2-SGH
ok 1 - set_timeout, 60000
ok 2 - open, /production-ui/\#login
ok 3 - wait_for_page_to_load, 120000
ok 4
ok 5 - type, css=input.test-class-component-username, nagximon
ok 6 - type, css=input.test-class-component-password, C0mplexNagMon123
ok 7 - click, css=button.gwt-Button
ok 8
ok 9 - type, css=input.gwt-TextBox, 53202201
ok 10 - click, css=button.gwt-Button
ok 11
ok 12 - click, css=div.GCJ52QVBBI > span > input[type="checkbox"]
ok 13 - click, css=button.GCJ52QVBDL
Key: ALL, Value: 1487608866.80076:1
Start Time: 1487608866.80076
End Time: 1487608885.94119
OK: Processes completed after 19.14 of page loading.|0| ALL=19.14s;30;45;0;60 |1..13
[root@fikc-nagxiprod01 libexec]#
Re: Flap and Retain status issues with the service
I would need to see the full call trace from the RC server to determine if it's a "false negative" or not. I know check_selenium by itself will throw a "CRITICAL" if it finds the word "ERROR" anywhere in the test case's output, though I believe the RC server will re-try the test if it can't establish a session ID on the first attempt in many cases. So the first session on these tests might be failing, but a second/third may be succeeding. The logic in check_selenium i'm referring to:
You might try altering this script to set $message equal to $output like so to get better debug information:
Though this could cause issues with the status output overflowing as @avandemore pointed out.
Code: Select all
my $output = `perl $script 2>&1`;
if ( $output =~ m/(ERROR.+\n)/ ) {
$message = $1;
$rc = 2; #end it with an error
} elsif ( $output =~ m/OK/ ) {
my @lines = split(/\n/, $output);
foreach my $line (@lines) {
if ( $line =~ m/OK:/ ) {
($message, $rc, $performance_msg, $items_ran) = split(/\|/, $line);
$message = "$message | $performance_msg\n";
}
}
} else {
$message = "UNKNOWN: $output";
$rc = 3;
}
Code: Select all
$message = $output;Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/