Page 2 of 3
Re: Flap and Retain status issues with the service
Posted: Fri Feb 03, 2017 9:49 am
by dlukinski
tgriep wrote:I was talking about the plugin changes but is you want me to look in to why that check works when you force the check, I will need to see the following files.
Code: Select all
/usr/local/nagios/var/nagios.log
/usr/local/nagios/var/status.dat
Added zipped files
Re: Flap and Retain status issues with the service
Posted: Fri Feb 03, 2017 10:43 am
by tgriep
All of the settings in those files look good for that service and it looks like the current state of that check is OK.
The next time you see it go to a critical state and not recover, upload the same log files so we can check them when it is failing.
Re: Flap and Retain status issues with the service
Posted: Sun Feb 05, 2017 11:23 am
by dlukinski
tgriep wrote:All of the settings in those files look good for that service and it looks like the current state of that check is OK.
The next time you see it go to a critical state and not recover, upload the same log files so we can check them when it is failing.
Here you go: happening again
Please look at service screenshot attached (VM, which runs it in GUI in Firefox, runs one OK, no issues. SELENIUM script starts and end OK)
Re: Flap and Retain status issues with the service
Posted: Mon Feb 06, 2017 10:00 am
by tgriep
The screen shot is not attached to the post.
Can you attach it again as well as the log files is the service check is failing to restart?
These are the files we need to see when the service doesn't run the check when it is down.
Code: Select all
/usr/local/nagios/var/nagios.log
/usr/local/nagios/var/status.dat
Re: Flap and Retain status issues with the service
Posted: Sat Feb 11, 2017 1:36 pm
by dlukinski
tgriep wrote:The screen shot is not attached to the post.
Can you attach it again as well as the log files is the service check is failing to restart?
These are the files we need to see when the service doesn't run the check when it is down.
Code: Select all
/usr/local/nagios/var/nagios.log
/usr/local/nagios/var/status.dat
Re: Flap and Retain status issues with the service
Posted: Mon Feb 13, 2017 1:10 pm
by avandemore
Do the selenium checks generate a lot of output? What is the output of:
There is a known issue with Core and large outputs fixed in XI 5.4.x.
Re: Flap and Retain status issues with the service
Posted: Mon Feb 13, 2017 1:19 pm
by bwallace
This forum may be of use. Have we checked the Selenium Logs yet?
https://github.com/seleniumhq/selenium- ... ssues/2716
Re: Flap and Retain status issues with the service
Posted: Mon Feb 13, 2017 1:25 pm
by mcapra
If we could also see the outputs from those Selenium checks from the CLI, that might be helpful. I know the IDE's default timeout is pretty low which might be causing the false-negatives.
Re: Flap and Retain status issues with the service
Posted: Mon Feb 20, 2017 11:42 am
by dlukinski
mcapra wrote:If we could also see the outputs from those Selenium checks from the CLI, that might be helpful. I know the IDE's default timeout is pretty low which might be causing the false-negatives.
OK, because these issues come and go I am to wait until we upgrade to 5.4.2 and review again after that.
Here is the one just bounced back from the same exception:
Code: Select all
[root@fikc-nagxiprod01 libexec]# perl selenium-alfresco-2-SGH
ok 1 - set_timeout, 60000
ok 2 - open, /production-ui/\#login
ok 3 - wait_for_page_to_load, 120000
ok 4
ok 5 - type, css=input.test-class-component-username, nagximon
ok 6 - type, css=input.test-class-component-password, C0mplexNagMon123
ok 7 - click, css=button.gwt-Button
ok 8
ok 9 - type, css=input.gwt-TextBox, 53202201
ok 10 - click, css=button.gwt-Button
ok 11
ok 12 - click, css=div.GCJ52QVBBI > span > input[type="checkbox"]
ok 13 - click, css=button.GCJ52QVBDL
Key: ALL, Value: 1487608866.80076:1
Start Time: 1487608866.80076
End Time: 1487608885.94119
OK: Processes completed after 19.14 of page loading.|0| ALL=19.14s;30;45;0;60 |1..13
[root@fikc-nagxiprod01 libexec]#
Re: Flap and Retain status issues with the service
Posted: Mon Feb 20, 2017 4:09 pm
by mcapra
I would need to see the full call trace from the RC server to determine if it's a "false negative" or not. I know
check_selenium by itself will throw a "CRITICAL" if it finds the word "ERROR" anywhere in the test case's output, though I believe the RC server will re-try the test if it can't establish a session ID on the first attempt in many cases. So the first session on these tests might be failing, but a second/third may be succeeding. The logic in check_selenium i'm referring to:
Code: Select all
my $output = `perl $script 2>&1`;
if ( $output =~ m/(ERROR.+\n)/ ) {
$message = $1;
$rc = 2; #end it with an error
} elsif ( $output =~ m/OK/ ) {
my @lines = split(/\n/, $output);
foreach my $line (@lines) {
if ( $line =~ m/OK:/ ) {
($message, $rc, $performance_msg, $items_ran) = split(/\|/, $line);
$message = "$message | $performance_msg\n";
}
}
} else {
$message = "UNKNOWN: $output";
$rc = 3;
}
You might try altering this script to set $message equal to $output like so to get better debug information:
Though this could cause issues with the status output overflowing as
@avandemore pointed out.