Flap and Retain status issues with the service

dlukinski · Post by **dlukinski** » Fri Feb 03, 2017 9:49 am

tgriep wrote:I was talking about the plugin changes but is you want me to look in to why that check works when you force the check, I will need to see the following files.
Code: Select all
/usr/local/nagios/var/nagios.log
/usr/local/nagios/var/status.dat

Added zipped files

Post by **tgriep** » Fri Feb 03, 2017 10:43 am

All of the settings in those files look good for that service and it looks like the current state of that check is OK.
The next time you see it go to a critical state and not recover, upload the same log files so we can check them when it is failing.

dlukinski · Post by **dlukinski** » Sun Feb 05, 2017 11:23 am

tgriep wrote:All of the settings in those files look good for that service and it looks like the current state of that check is OK.
The next time you see it go to a critical state and not recover, upload the same log files so we can check them when it is failing.

Here you go: happening again

Please look at service screenshot attached (VM, which runs it in GUI in Firefox, runs one OK, no issues. SELENIUM script starts and end OK)

Post by **tgriep** » Mon Feb 06, 2017 10:00 am

The screen shot is not attached to the post.
Can you attach it again as well as the log files is the service check is failing to restart?
These are the files we need to see when the service doesn't run the check when it is down.

Code: Select all

/usr/local/nagios/var/nagios.log
/usr/local/nagios/var/status.dat

dlukinski · Post by **dlukinski** » Sat Feb 11, 2017 1:36 pm

tgriep wrote:The screen shot is not attached to the post.
Can you attach it again as well as the log files is the service check is failing to restart?
These are the files we need to see when the service doesn't run the check when it is down.
Code: Select all
/usr/local/nagios/var/nagios.log
/usr/local/nagios/var/status.dat

avandemore · Post by **avandemore** » Mon Feb 13, 2017 1:10 pm

Do the selenium checks generate a lot of output? What is the output of:

Code: Select all

# /usr/local/nagios/bin/nagios -v

There is a known issue with Core and large outputs fixed in XI 5.4.x.

bwallace · Post by **bwallace** » Mon Feb 13, 2017 1:19 pm

This forum may be of use. Have we checked the Selenium Logs yet?
https://github.com/seleniumhq/selenium- ... ssues/2716

Post by **mcapra** » Mon Feb 13, 2017 1:25 pm

If we could also see the outputs from those Selenium checks from the CLI, that might be helpful. I know the IDE's default timeout is pretty low which might be causing the false-negatives.

dlukinski · Post by **dlukinski** » Mon Feb 20, 2017 11:42 am

mcapra wrote:If we could also see the outputs from those Selenium checks from the CLI, that might be helpful. I know the IDE's default timeout is pretty low which might be causing the false-negatives.

OK, because these issues come and go I am to wait until we upgrade to 5.4.2 and review again after that.
Here is the one just bounced back from the same exception:

Code: Select all


[root@fikc-nagxiprod01 libexec]# perl selenium-alfresco-2-SGH
ok 1 - set_timeout, 60000
ok 2 - open, /production-ui/\#login
ok 3 - wait_for_page_to_load, 120000
ok 4
ok 5 - type, css=input.test-class-component-username, nagximon
ok 6 - type, css=input.test-class-component-password, C0mplexNagMon123
ok 7 - click, css=button.gwt-Button
ok 8
ok 9 - type, css=input.gwt-TextBox, 53202201
ok 10 - click, css=button.gwt-Button
ok 11
ok 12 - click, css=div.GCJ52QVBBI > span > input[type="checkbox"]
ok 13 - click, css=button.GCJ52QVBDL
Key: ALL, Value: 1487608866.80076:1
Start Time: 1487608866.80076
End Time: 1487608885.94119
OK: Processes completed after 19.14 of page loading.|0| ALL=19.14s;30;45;0;60  |1..13
[root@fikc-nagxiprod01 libexec]#

Post by **mcapra** » Mon Feb 20, 2017 4:09 pm

I would need to see the full call trace from the RC server to determine if it's a "false negative" or not. I know check_selenium by itself will throw a "CRITICAL" if it finds the word "ERROR" anywhere in the test case's output, though I believe the RC server will re-try the test if it can't establish a session ID on the first attempt in many cases. So the first session on these tests might be failing, but a second/third may be succeeding. The logic in check_selenium i'm referring to:

Code: Select all

my $output = `perl $script 2>&1`;
if ( $output =~ m/(ERROR.+\n)/ ) {
    $message = $1;
    $rc = 2; #end it with an error
} elsif ( $output =~ m/OK/ )  {
    my @lines = split(/\n/, $output);
    foreach my $line (@lines) {
        if ( $line =~ m/OK:/ ) {
            ($message, $rc, $performance_msg, $items_ran) = split(/\|/, $line);
            $message = "$message | $performance_msg\n";
        }
    }
} else {
    $message = "UNKNOWN: $output";
    $rc = 3;
}

You might try altering this script to set $message equal to $output like so to get better debug information:

Code: Select all

$message = $output;

Though this could cause issues with the status output overflowing as @avandemore pointed out.

Nagios Support Forum

Flap and Retain status issues with the service

Re: Flap and Retain status issues with the service

Re: Flap and Retain status issues with the service

Re: Flap and Retain status issues with the service

Re: Flap and Retain status issues with the service

Re: Flap and Retain status issues with the service

Re: Flap and Retain status issues with the service

Re: Flap and Retain status issues with the service

Re: Flap and Retain status issues with the service

Re: Flap and Retain status issues with the service

Re: Flap and Retain status issues with the service