Page 2 of 2

Re: How do I not alert on Socket timeout after 30 seconds.

Posted: Thu Jul 25, 2019 12:07 pm
by tgriep
I am not seeing an email Notification for a socket timeout.
The Notification is for the CPU Usage being over 95%.

There is an Alert for the timeout but no Notification for it.
Alerts and Notifications are 2 different things. See this.
https://www.youtube.com/watch?v=EDmZ6NtCH7s

About the Notification after 1.
Did you upgrade the version of XI on the server?
There are some bugs in core that could cause this issue.
FYI, there was a maintenance release for core yesterday that could fix the issue.
Do you want to try that?

Another cause could be a duplicate nagios process.
Run the following as root to ensure that is not happening.

Code: Select all

service nagios stop
killall -9 nagios
service nagios start

Re: How do I not alert on Socket timeout after 30 seconds.

Posted: Mon Jul 29, 2019 4:41 pm
by gormank
I'm not sure if it helps but I run into the occasionally...
I set the check_nrpe commands to tell it to set the alert to unknown rather than critical and not notify on unknown.
check_nrpe -t n:m ...
n is number of seconds to timeout
m is severity (1-3) and 3 is unknown

I think Nagios also has a global timeout for checks so if that's less than check_nrpe, it will be used.

Re: How do I not alert on Socket timeout after 30 seconds.

Posted: Mon Jul 29, 2019 4:48 pm
by lmiltchev
With NRPE v3, you could use the "-u" flag to return UNKNOWN instead of CRITICAL on connection issues.
-u, --unknown-timeout Make connection problems return UNKNOWN instead of CRITICAL

Re: How do I not alert on Socket timeout after 30 seconds.

Posted: Tue Aug 06, 2019 7:59 am
by mkeey
My apologies for the excessive delay. A lot going on at work.
I like the suggestion of returning "unknown" with the "u" option. Not sure where this should be coded however. Can you provide some instructions?

Re: How do I not alert on Socket timeout after 30 seconds.

Posted: Tue Aug 06, 2019 8:33 am
by lmiltchev
You can add the "-u" flag to the check_nrpe command.

CCM > Commands > check_nrpe > Edit > Save > Apply Configuration

Example:

Code: Select all

define command {
    command_name    check_nrpe
    command_line    $USER1$/check_nrpe -u -H $HOSTADDRESS$ -t 60 -c $ARG1$ $ARG2$
}

Re: How do I not alert on Socket timeout after 30 seconds.

Posted: Wed Aug 07, 2019 2:51 pm
by mkeey
Excellent! Thank you. Discussed with the team and manager. We'll be giving this a go on our Test and UAT systems. Please leave case open until I try this for a while.

Re: How do I not alert on Socket timeout after 30 seconds.

Posted: Wed Aug 07, 2019 3:28 pm
by lmiltchev
Sure. We will keep the topic open for now.

Re: How do I not alert on Socket timeout after 30 seconds.

Posted: Tue Aug 13, 2019 10:00 am
by mkeey
Added the "-u" to the check_nrpe script via the XI CCM screen. Almost immediately things started going to an "unknown" status. But, only for our Linux servers. Windows still issued the Critical for socket timeouts and connection refused message.


SERVER01 (Windows)
CPU Usage
Critical 24d 23h 26m 23s 20/20 08/13/2019 10:44:19 CRITICAL - Socket timeout
Drive C: Disk Usage
Critical 24d 23h 26m 28s 05/05 08/13/2019 10:45:56 CRITICAL - Socket timeout
Memory Usage
Critical 24d 23h 25m 39s 24/24 08/13/2019 10:46:16 CRITICAL - Socket timeout


SERVER03 (Linux Cloud)
/ Disk Usage
Unknown 04d 22h 05m 57s 05/05 08/13/2019 10:45:21 (No output on stdout) stderr: connect to address 10.225.73.14 port 5666: Connection refused
/home Disk Usage
Unknown 04d 22h 06m 19s 05/05 08/13/2019 10:44:53 (No output on stdout) stderr: connect to address 10.225.73.14 port 5666: Connection refused
/opt Disk Usage
Unknown 04d 22h 10m 07s 05/05 08/13/2019 10:46:19 (No output on stdout) stderr: connect to address 10.225.73.14 port 5666: Connection refused
/tmp Disk Usage
Unknown 04d 22h 09m 48s 05/05 08/13/2019 10:46:28 (No output on stdout) stderr: connect to address 10.225.73.14 port 5666: Connection refused
/var Disk Usage
Unknown 04d 22h 07m 47s 05/05 08/13/2019 10:48:41 (No output on stdout) stderr: connect to address 10.225.73.14 port 5666: Connection refused
/var/log Disk Usage
Unknown 04d 22h 07m 14s 05/05 08/13/2019 10:49:13 (No output on stdout) stderr: connect to address 10.225.73.14 port 5666: Connection refused
CPU Stats
Unknown 04d 22h 06m 43s 20/20 08/13/2019 10:44:49 (No output on stdout) stderr: connect to address 10.225.73.14 port 5666: Connection refused
Memory Usage
Unknown 04d 22h 02m 39s 24/24 08/13/2019 10:40:15 (No output on stdout) stderr: connect to address 10.225.73.14 port 5666: Connection refused


SERVER04 (Windows)
CPU Usage
Critical 11d 00h 56m 44s 20/20 08/13/2019 10:45:26 connect to address 10.13.7.8 and port 12489: Connection refused
Drive C: Disk Usage
Critical 11d 00h 59m 11s 05/05 08/13/2019 10:47:07 connect to address 10.13.7.8 and port 12489: Connection refused
Memory Usage
Critical 11d 00h 55m 33s 24/24 08/13/2019 10:44:25 connect to address 10.13.7.8 and port 12489: Connection refused


SERVER05 (Linux OnPrem)
/ Disk Usage
Unknown 04d 22h 8m 29s 05/05 08/13/2019 10:48:18 (No output on stdout) stderr: connect to address 10.186.7.156 port 5666: Connection refused
/home Disk Usage
Unknown 04d 22h 10m 25s 05/05 08/13/2019 10:46:25 (No output on stdout) stderr: connect to address 10.186.7.156 port 5666: Connection refused
/opt Disk Usage
Unknown 04d 22h 9m 30s 05/05 08/13/2019 10:47:17 (No output on stdout) stderr: connect to address 10.186.7.156 port 5666: Connection refused
/srv/bit9 Disk Usage
Unknown 04d 22h 8m 29s 05/05 08/13/2019 10:48:21 (No output on stdout) stderr: connect to address 10.186.7.156 port 5666: Connection refused
/tmp Disk Usage
Unknown 04d 22h 7m 53s 05/05 08/13/2019 10:48:50 (No output on stdout) stderr: connect to address 10.186.7.156 port 5666: Connection refused
/var Disk Usage
Unknown 04d 22h 6m 37s 05/05 08/13/2019 10:45:23 (No output on stdout) stderr: connect to address 10.186.7.156 port 5666: Connection refused
/var/log Disk Usage
Unknown 04d 22h 5m 47s 05/05 08/13/2019 10:46:10 (No output on stdout) stderr: connect to address 10.186.7.156 port 5666: Connection refused
CPU Stats
Unknown 04d 22h 8m 25s 20/20 08/13/2019 10:48:30 (No output on stdout) stderr: connect to address 10.186.7.156 port 5666: Connection refused
Memory Usage
Unknown 04d 22h 2m 17s 24/24 08/13/2019 10:40:49 (No output on stdout) stderr: connect to address 10.186.7.156 port 5666: Connection refused


SERVER06 (Linux OnPrem)
/ Disk Usage
Unknown 04d 22h 10m 29s 05/05 08/13/2019 10:46:15 (No output on stdout) stderr: connect to address 10.186.7.158 port 5666: Connection refused
/home Disk Usage
Unknown 04d 22h 08m 54s 05/05 08/13/2019 10:48:03 (No output on stdout) stderr: connect to address 10.186.7.158 port 5666: Connection refused
/opt Disk Usage
Unknown 04d 22h 08m 15s 05/05 08/13/2019 10:48:37 (No output on stdout) stderr: connect to address 10.186.7.158 port 5666: Connection refused
/srv/bit9 Disk Usage
Unknown 04d 22h 09m 24s 05/05 08/13/2019 10:47:28 (No output on stdout) stderr: connect to address 10.186.7.158 port 5666: Connection refused
/tmp Disk Usage
Unknown 04d 22h 08m 35s 05/05 08/13/2019 10:48:13 (No output on stdout) stderr: connect to address 10.186.7.158 port 5666: Connection refused
/var Disk Usage
Unknown 04d 22h 07m 34s 05/05 08/13/2019 10:49:15 (No output on stdout) stderr: connect to address 10.186.7.158 port 5666: Connection refused
/var/log Disk Usage
Unknown 04d 22h 06m 54s 05/05 08/13/2019 10:44:58 (No output on stdout) stderr: connect to address 10.186.7.158 port 5666: Connection refused
CPU Stats
Unknown 04d 22h 05m 41s 20/20 08/13/2019 10:46:23 (No output on stdout) stderr: connect to address 10.186.7.158 port 5666: Connection refused
Memory Usage
Unknown 04d 22h 04m 3s 24/24 08/13/2019 10:48:57 (No output on stdout) stderr: connect to address 10.186.7.158 port 5666: Connection refused

Re: How do I not alert on Socket timeout after 30 seconds.

Posted: Tue Aug 13, 2019 10:56 am
by lmiltchev
Windows still issued the Critical for socket timeouts and connection refused message.
This is because you are not using check_nrpe for these checks but check_nt... check_nt has the same option, however it's deprecated, and it doesn't work.
Usage:
check_nt -H host -v variable [-p port] [-w warning] [-c critical]
[-l params] [-d SHOWALL] [-u](DEPRECATED) [-t timeout]
You may need to switch to using check_nrpe instead of check_nt. See the NSClient++ documentation here:

https://docs.nsclient.org/howto/nrpe/