Flap and Retain status issues with the service

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
dlukinski
Posts: 1130
Joined: Tue Oct 06, 2015 9:42 am

Re: Flap and Retain status issues with the service

Post by dlukinski »

tgriep wrote:I was talking about the plugin changes but is you want me to look in to why that check works when you force the check, I will need to see the following files.

Code: Select all

/usr/local/nagios/var/nagios.log
/usr/local/nagios/var/status.dat
Added zipped files
Last edited by tgriep on Fri Feb 03, 2017 10:44 am, edited 1 time in total.
Reason: Removed nagios.zip
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Flap and Retain status issues with the service

Post by tgriep »

All of the settings in those files look good for that service and it looks like the current state of that check is OK.
The next time you see it go to a critical state and not recover, upload the same log files so we can check them when it is failing.
Be sure to check out our Knowledgebase for helpful articles and solutions!
dlukinski
Posts: 1130
Joined: Tue Oct 06, 2015 9:42 am

Re: Flap and Retain status issues with the service

Post by dlukinski »

tgriep wrote:All of the settings in those files look good for that service and it looks like the current state of that check is OK.
The next time you see it go to a critical state and not recover, upload the same log files so we can check them when it is failing.
Here you go: happening again

Please look at service screenshot attached (VM, which runs it in GUI in Firefox, runs one OK, no issues. SELENIUM script starts and end OK)
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Flap and Retain status issues with the service

Post by tgriep »

The screen shot is not attached to the post.
Can you attach it again as well as the log files is the service check is failing to restart?
These are the files we need to see when the service doesn't run the check when it is down.

Code: Select all

/usr/local/nagios/var/nagios.log
/usr/local/nagios/var/status.dat
Be sure to check out our Knowledgebase for helpful articles and solutions!
dlukinski
Posts: 1130
Joined: Tue Oct 06, 2015 9:42 am

Re: Flap and Retain status issues with the service

Post by dlukinski »

tgriep wrote:The screen shot is not attached to the post.
Can you attach it again as well as the log files is the service check is failing to restart?
These are the files we need to see when the service doesn't run the check when it is down.

Code: Select all

/usr/local/nagios/var/nagios.log
/usr/local/nagios/var/status.dat
You do not have the required permissions to view the files attached to this post.
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: Flap and Retain status issues with the service

Post by avandemore »

Do the selenium checks generate a lot of output? What is the output of:

Code: Select all

# /usr/local/nagios/bin/nagios -v
There is a known issue with Core and large outputs fixed in XI 5.4.x.
Previous Nagios employee
bwallace
Posts: 1145
Joined: Tue Nov 17, 2015 1:57 pm

Re: Flap and Retain status issues with the service

Post by bwallace »

This forum may be of use. Have we checked the Selenium Logs yet?
https://github.com/seleniumhq/selenium- ... ssues/2716
Be sure to check out the Knowledgebase for helpful articles and solutions!
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Flap and Retain status issues with the service

Post by mcapra »

If we could also see the outputs from those Selenium checks from the CLI, that might be helpful. I know the IDE's default timeout is pretty low which might be causing the false-negatives.
Former Nagios employee
https://www.mcapra.com/
dlukinski
Posts: 1130
Joined: Tue Oct 06, 2015 9:42 am

Re: Flap and Retain status issues with the service

Post by dlukinski »

mcapra wrote:If we could also see the outputs from those Selenium checks from the CLI, that might be helpful. I know the IDE's default timeout is pretty low which might be causing the false-negatives.
OK, because these issues come and go I am to wait until we upgrade to 5.4.2 and review again after that.
Here is the one just bounced back from the same exception:

Code: Select all


[root@fikc-nagxiprod01 libexec]# perl selenium-alfresco-2-SGH
ok 1 - set_timeout, 60000
ok 2 - open, /production-ui/\#login
ok 3 - wait_for_page_to_load, 120000
ok 4
ok 5 - type, css=input.test-class-component-username, nagximon
ok 6 - type, css=input.test-class-component-password, C0mplexNagMon123
ok 7 - click, css=button.gwt-Button
ok 8
ok 9 - type, css=input.gwt-TextBox, 53202201
ok 10 - click, css=button.gwt-Button
ok 11
ok 12 - click, css=div.GCJ52QVBBI > span > input[type="checkbox"]
ok 13 - click, css=button.GCJ52QVBDL
Key: ALL, Value: 1487608866.80076:1
Start Time: 1487608866.80076
End Time: 1487608885.94119
OK: Processes completed after 19.14 of page loading.|0| ALL=19.14s;30;45;0;60  |1..13
[root@fikc-nagxiprod01 libexec]#

User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Flap and Retain status issues with the service

Post by mcapra »

I would need to see the full call trace from the RC server to determine if it's a "false negative" or not. I know check_selenium by itself will throw a "CRITICAL" if it finds the word "ERROR" anywhere in the test case's output, though I believe the RC server will re-try the test if it can't establish a session ID on the first attempt in many cases. So the first session on these tests might be failing, but a second/third may be succeeding. The logic in check_selenium i'm referring to:

Code: Select all

my $output = `perl $script 2>&1`;
if ( $output =~ m/(ERROR.+\n)/ ) {
    $message = $1;
    $rc = 2; #end it with an error
} elsif ( $output =~ m/OK/ )  {
    my @lines = split(/\n/, $output);
    foreach my $line (@lines) {
        if ( $line =~ m/OK:/ ) {
            ($message, $rc, $performance_msg, $items_ran) = split(/\|/, $line);
            $message = "$message | $performance_msg\n";
        }
    }
} else {
    $message = "UNKNOWN: $output";
    $rc = 3;
}
You might try altering this script to set $message equal to $output like so to get better debug information:

Code: Select all

$message = $output;
Though this could cause issues with the status output overflowing as @avandemore pointed out.
Former Nagios employee
https://www.mcapra.com/
Locked