Page 1 of 2

[solved] Socket timeout madness

Posted: Sat Sep 15, 2018 4:24 pm
by KalimAlRazif
Hi,
My nagios 4.2.1 throws a timeout socket on certain hosts, with nrpe and ssh checks, I put a tcp dump on nagios server and one of the affected remote servers, if nagios attempt the check no traffic is detected but if I perform the check via normal shell the tcp dump on both machices detect traffic and the check works perfectly.

This behavior is new, till today the nagios instance was working perfectly. I deleted the logs and retention files, reboot nagios container (lcx) and the remote container too, some of the containers are on another machine and some on the same phisical machine.

Please, any ideas? :?

Thanks in advance
Nomar

Re: Socket timeout madness

Posted: Mon Sep 17, 2018 11:37 am
by cdienger
Are the configured checks configured to use the IP address or hostname of the destination? How exactly are you running the plugins and tcpdump on the command line? If the tcpdump is capturing only a specific port or IP address and the configured checks fail due to a DNS issue, this could produce the behavior you're seeing.

Re: Socket timeout madness

Posted: Tue Sep 18, 2018 7:43 am
by KalimAlRazif
cdienger wrote:Are the configured checks configured to use the IP address or hostname of the destination? How exactly are you running the plugins and tcpdump on the command line? If the tcpdump is capturing only a specific port or IP address and the configured checks fail due to a DNS issue, this could produce the behavior you're seeing.
Hi,
Indeed all of the affected hosts were defined by name, I did change them and now are defined by IP but no difference :cry:

Re: Socket timeout madness

Posted: Tue Sep 18, 2018 10:21 am
by cdienger
And the nagios service was restarted after making these changes, correct?

What options are running with the tcpdump ? I would update it to capture port 53 traffic and also the IP addresses of the that the hosts names we pointing at.

Can you share the config files that include the check and command config ?

Re: Socket timeout madness

Posted: Wed Sep 19, 2018 9:26 am
by KalimAlRazif
cdienger wrote:And the nagios service was restarted after making these changes, correct?

What options are running with the tcpdump ? I would update it to capture port 53 traffic and also the IP addresses of the that the hosts names we pointing at.

Can you share the config files that include the check and command config ?
Yes, the service was restarted.

on nagios host:

Code: Select all

tcpdump -n dst host remote_host_ip -vv
on remote host:

Code: Select all

tcpdump -n src host nagios_host_ip -vv
This two commands are presenting errors:

Code: Select all

define command {
                command_name                          check_nrpe
                command_line                          $USER1$/check_nrpe -H '$HOSTADDRESS$' -c '$ARG1$' -t 30:3
}
define command {
                command_name                          check_ssh
                command_line                          $USER1$/check_ssh $ARG1$ $HOSTADDRESS$
}
ssh check config

Code: Select all

define service {
                service_description                   ssh
                check_command                         check_ssh!
                check_period                          24x7
                notification_period                   24x7
                host_name                             list of hosts separated by comma
                servicegroups                         ssh
                contact_groups                        +admins,jefecg
                use                                   generic-service
}
example nrpe check

Code: Select all

define service {
                service_description                   load
                check_command                         check_nrpe!check_load
                check_period                          24x7
                notification_period                   24x7
                host_name                             list of hosts separated by comma
                contact_groups                        +admins,jefecg
                use                                   generic-service
}
But let me do some changes on host names, the address are ip address, but host name still are the fqdn of host.

Re: Socket timeout madness

Posted: Wed Sep 19, 2018 9:38 am
by KalimAlRazif
No way :-( using only IP address on configs did not work :-(

Re: Socket timeout madness

Posted: Thu Sep 20, 2018 4:18 pm
by npolovenko
@KalimAlRazif, Can you run the nmap command with the hosts IP address from the Nagios server and show me the output?

Re: Socket timeout madness

Posted: Mon Sep 24, 2018 1:09 pm
by KalimAlRazif
the output of the command, executed from nagios host

Code: Select all

root@nagios:~# nmap -P0 ip_of_one_of_the_failed_hosts

Starting Nmap 6.00 ( http://nmap.org ) at 2018-09-24 14:04 EDT
Nmap scan report for ipXXX.ip-XX-XX-XX.net (ip_of_one_of_the_failed_hosts)
Host is up (0.090s latency).
Not shown: 996 closed ports
PORT     STATE SERVICE
22/tcp   open  ssh
5666/tcp open  nrpe
9101/tcp open  jetdirect
9103/tcp open  jetdirect

Nmap done: 1 IP address (1 host up) scanned in 1.57 seconds

Code: Select all

root@nagios-new:~# nmap -P0 another_host_with_error

Starting Nmap 6.00 ( http://nmap.org ) at 2018-09-24 14:07 EDT
Nmap scan report for another_host_with_error (another_host_with_error)
Host is up (0.0094s latency).
Not shown: 996 closed ports
PORT     STATE SERVICE
22/tcp   open  ssh
25/tcp   open  smtp
5666/tcp open  nrpe
9102/tcp open  jetdirect

Nmap done: 1 IP address (1 host up) scanned in 0.31 seconds

Re: Socket timeout madness

Posted: Tue Sep 25, 2018 12:07 pm
by npolovenko
@KalimAlRazif, You said
if nagios attempts the check no traffic is detected but if I perform the check via normal shell the TCP dump on both machines detect traffic and the check works perfectly.
Can you actually run the command from the command line and show me the output? And then take a screenshot of the command failing in the web interface?

Re: Socket timeout madness

Posted: Wed Sep 26, 2018 3:30 pm
by KalimAlRazif
Sure:


For example ssh is "failing"

Code: Select all

./check_ssh -H remote_ip
SSH OK - OpenSSH_7.2p2 Ubuntu-4ubuntu2.2 (protocol 2.0) | time=0.186555s;;;0.000000;10.000000