Page 1 of 1

check_nt intermittent time outs

Posted: Thu Oct 09, 2014 9:37 am
by MichielvM
Hi all,

We have one host which is monitored using different servicechecks.
Among those are windows services (which are checked using check_xi_service_nsclient (i.e. check_nt)
During an average day I get quite a few critical alerts reporting "CRITICAL - Socket timeout after 10 seconds" and getting an OK within one or two checks down the line.
I've thinkered with the "-t" parameter up to 60 seconds to no avail.
I've restarted the nsclient service a few times, dito.

I need to point out that this particular windows host is the only one at this site. The other hosts are VMS servers.
We're in the process of migrating to Nagios. The old monitoring tool (OpManager) is not reporting any of this behavior and inspection of the server shows nothing abnormal.
We use this host as a gateway between Nagios Eventlog and the VMS hosts. Those check are ok all the time.

I should think that there is something the matter with Nagios itself or at least with the check_nt plugin.

Here is an extract of the services config. I haven't listed all of them, just
STVMS4-Warning (100% OK)
Uptime (radom time outs)

Code: Select all

define service {
        host_name                       staame-veeam01
        service_description             STVMS4-Warning
        use                             xiwizard_windowseventlog_service
        max_check_attempts              1
        check_interval                  1
        retry_interval                  1
        check_period                    24x7
        notification_interval           1
        notification_period             standbyuren
        notification_options            w,
        notifications_enabled           0
        contacts                        Team 3
        stalking_options                o,w,c,u,
        icon_image                      windowseventlog.png
        _xiwizard                       windowseventlog
        register                        1
        }

define service {
        host_name                       staame-veeam01
        service_description             Uptime
        use                             xiwizard_windowsserver_nsclient_service
        check_command                   check_xi_service_nsclient!!UPTIME!!!!!!
        max_check_attempts              5
        check_interval                  5
        retry_interval                  1
        check_period                    24x7
        notification_interval           60
        notification_period             standbyuren
        notifications_enabled           0
        contacts                        Team 1
        _xiwizard                       windowsserver
        register                        1
        }
I see nothing out of the ordinary.

XI is 2012R2.9
Core version is 3.5.0

Re: check_nt intermittent time outs

Posted: Thu Oct 09, 2014 4:53 pm
by sreinhardt
Did altering the timeout have any effect on your potentially invalid alerts? I can tell you that it's likely not something with the core engine, as they use our product. More than likely it is either an nsclient issue (which version are you running), or possibly an issue with check_nt, which according to the nsclient developer is deprecated in favor of check_nrpe. Am I correct in understanding that this is an intermediary nsclient system? Such that you are using it to check other systems without the nagios system having to interact with it? What kind of a time range are you issues in, all the time, certain periods of the day? Do they seem to follow similar trends such as alerting around the same times of day, or seemingly completely random? Is it always a timeout issues that alerts you to this issue?

Re: check_nt intermittent time outs

Posted: Mon Oct 13, 2014 9:11 am
by MichielvM
Hi Spenser,

Yes, this is an intermediary system. The invalid checks however are local. I.e. they monitor the intermediate itself.
The intermediate role is simply to monitor snmp traps which are sent from VMS servers and converted to events using KIWI.
That part is stable.

The behaviour is seemingly random.
Tinkering with time-out (-t 10, 30, 60) has no effect.
I've downgraded nsclient to 0.3.9, no effect.

I took your advice about check_nrpe, and things look stable now. ;)

Re: check_nt intermittent time outs

Posted: Mon Oct 13, 2014 9:41 am
by tmcdonald
Shall we keep the topic open or are you set to have it closed?

Re: check_nt intermittent time outs

Posted: Thu Oct 16, 2014 4:13 am
by MichielvM
as far as I am concerned it can be closed. :)