Sporadic 'Connection refused' errors in 4.2.4

kernow5000 · Post by **kernow5000** » Tue Jan 10, 2017 5:13 am

Yesterday I tweaked a few timeouts on checks to be higher.

I got a single failed->ok email only notification pair for one host - which is in a service check with two other hosts. The two other hosts checked out fine.

Same old connection refused error, same Error 11's in syslog.

Jan 10 04:21:38 REDACTED nagios: job 836 (pid=3439): read() returned error 11
Jan 10 04:26:38 REDACTED nagios: job 841 (pid=5283): read() returned error 11

email alerts:

State: CRITICAL
Date/Time: Tue Jan 10 04:21:38 GMT 2017
Additional Info:
connect to address REDACTED and port 443: Connection refused

State: OK
Date/Time: Tue Jan 10 04:26:38 GMT 2017
Additional Info:
HTTP OK: HTTP/1.1 301 Moved Permanently - 472 bytes in 0.079 second response time

As you can see these match up to the syslog notifications.
As many developers have said, these error 11's are possibly just informational, but I'd love to get rid of these 'connection refused' false positives.

Shaun

kernow5000 · Post by **kernow5000** » Tue Jan 10, 2017 11:39 am

Same host just got the same connection refused error - same error 11 in syslog too

check_http works fine from the command line, as does telnet. Funny how it was this one at 4AM this morning too.

Jan 10 16:31:38 backupserver nagios: job 1675 (pid=4218): read() returned error 11

from nagios.log
[1484022098] SERVICE ALERT: REDACTED;HTTPS check;CRITICAL;HARD;1;connect to address REDACTED and port 443: Connection refused
[1484022098] SERVICE NOTIFICATION: external;REDACTED;HTTPS check;CRITICAL;notify-service-by-email;connect to address REDACTED and port 443: Connection refused
[1484022398] SERVICE ALERT: REDACTED;HTTPS check;OK;HARD;1;HTTP OK: HTTP/1.1 301 Moved Permanently - 472 bytes in 0.079 second response time
[1484022398] SERVICE NOTIFICATION: external;REDACTED;HTTPS check;OK;notify-service-by-email;HTTP OK: HTTP/1.1 301 Moved Permanently - 472 bytes in 0.079 second response time
[1484065298] SERVICE ALERT: REDACTED;HTTPS check;CRITICAL;HARD;1;connect to address REDACTED and port 443: Connection refuse
[1484065298] SERVICE NOTIFICATION: external;REDACTED;HTTPS check;CRITICAL;notify-service-by-email;connect to address REDACTED and port 443: Connection refused

kernow5000 · Post by **kernow5000** » Tue Jan 10, 2017 11:54 am

I might just remove the check for that host ... ha

dwhitfield · Post by **dwhitfield** » Tue Jan 10, 2017 12:10 pm

Looks like how those are stored changed in 4.2.3. It might just be a matter of changing your log level.

nagios: job XX (pid=YY): read() returned error 11 (changed from LOG_ERR to LOG_NOTICE)

https://github.com/NagiosEnterprises/na ... /Changelog

I'd be happy to do a bit more digging, but if removing the check is ok for you, that works for me too.

kernow5000 · Post by **kernow5000** » Tue Jan 10, 2017 12:12 pm

Now it's failed and gone to critical and eventually sent an SMS instead of an email for that host.

Host is fine and completely accessible.

connect to address REDACTED and port 443: Connection refused

Does it matter I'm using host_name instead of host_address in host blocks?

I don't want this to turn into a nublet-nagios-config-101 thread as I think I can manage that by myself. But this one host ... bah!
Not to mention I don't know if the others are fixed now or just being rather quiet.

kernow5000 · Post by **kernow5000** » Tue Jan 10, 2017 12:24 pm

Info: CRITICAL - Socket timeout

Socket timeout, hmm - same host

Different host: connect to address REDACTED and port 25: Connection refused

I really don't understand how 99% of the time it's fine and then has these little blips. However at least I know it's working I guess.
Weird how it's always connection refused errors, when nothing changes on the host side and the platform is completely fine and operational.

dwhitfield · Post by **dwhitfield** » Tue Jan 10, 2017 1:03 pm

What's the output of ulimit -a on the servers that are returning connection refused?

kernow5000 · Post by **kernow5000** » Wed Jan 11, 2017 4:18 am

Hi,

Code: Select all

[ec2-user@redacted ~]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 15734
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 15734
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

kernow5000 · Post by **kernow5000** » Wed Jan 11, 2017 10:48 am

Another - but this wasn't really a false positive as the platform had technically failed. Still.. wasn't expecting a connection refused.

Jan 11 15:33:03 REDACTED nagios: job 1555 (pid=25022): read() returned error 11
Jan 11 15:38:03 REDACTED nagios: job 1562 (pid=26781): read() returned error 11

***** Nagios *****
Notification Type: PROBLEM
Service: HTTPS check text
Host: REDACTED
Address: REDACTED
State: CRITICAL
Date/Time: Wed Jan 11 15:33:03 GMT 2017
Additional Info:
connect to address REDACTED and port 443: Connection refused

***** Nagios *****
Notification Type: RECOVERY
Service: HTTPS check text
Host: REDACTED
Address: REDACTED
State: OK
Date/Time: Wed Jan 11 15:38:03 GMT 2017
Additional Info:
HTTP OK: HTTP/1.1 200 OK - 253 bytes in 0.009 second response time

Weird

dwhitfield · Post by **dwhitfield** » Wed Jan 11, 2017 12:31 pm

FWIW, here's the block that looks different on mine.

Code: Select all

open files                      (-n) 10000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240

The open files is the only one that really jumps out at me.

https://access.redhat.com/solutions/61334 should be of use.

Please let us know if you see any changes after increasing the limits.

Nagios Support Forum

Sporadic 'Connection refused' errors in 4.2.4

Re: Sporadic 'Connection refused' errors in 4.2.4

Re: Sporadic 'Connection refused' errors in 4.2.4

Re: Sporadic 'Connection refused' errors in 4.2.4

Re: Sporadic 'Connection refused' errors in 4.2.4

Re: Sporadic 'Connection refused' errors in 4.2.4

Re: Sporadic 'Connection refused' errors in 4.2.4

Re: Sporadic 'Connection refused' errors in 4.2.4

Re: Sporadic 'Connection refused' errors in 4.2.4

Re: Sporadic 'Connection refused' errors in 4.2.4

Re: Sporadic 'Connection refused' errors in 4.2.4