Page 1 of 2
[solved] Socket timeout madness
Posted: Sat Sep 15, 2018 4:24 pm
by KalimAlRazif
Hi,
My nagios 4.2.1 throws a timeout socket on certain hosts, with nrpe and ssh checks, I put a tcp dump on nagios server and one of the affected remote servers, if nagios attempt the check no traffic is detected but if I perform the check via normal shell the tcp dump on both machices detect traffic and the check works perfectly.
This behavior is new, till today the nagios instance was working perfectly. I deleted the logs and retention files, reboot nagios container (lcx) and the remote container too, some of the containers are on another machine and some on the same phisical machine.
Please, any ideas?
Thanks in advance
Nomar
Re: Socket timeout madness
Posted: Mon Sep 17, 2018 11:37 am
by cdienger
Are the configured checks configured to use the IP address or hostname of the destination? How exactly are you running the plugins and tcpdump on the command line? If the tcpdump is capturing only a specific port or IP address and the configured checks fail due to a DNS issue, this could produce the behavior you're seeing.
Re: Socket timeout madness
Posted: Tue Sep 18, 2018 7:43 am
by KalimAlRazif
cdienger wrote:Are the configured checks configured to use the IP address or hostname of the destination? How exactly are you running the plugins and tcpdump on the command line? If the tcpdump is capturing only a specific port or IP address and the configured checks fail due to a DNS issue, this could produce the behavior you're seeing.
Hi,
Indeed all of the affected hosts were defined by name, I did change them and now are defined by IP but no difference

Re: Socket timeout madness
Posted: Tue Sep 18, 2018 10:21 am
by cdienger
And the nagios service was restarted after making these changes, correct?
What options are running with the tcpdump ? I would update it to capture port 53 traffic and also the IP addresses of the that the hosts names we pointing at.
Can you share the config files that include the check and command config ?
Re: Socket timeout madness
Posted: Wed Sep 19, 2018 9:26 am
by KalimAlRazif
cdienger wrote:And the nagios service was restarted after making these changes, correct?
What options are running with the tcpdump ? I would update it to capture port 53 traffic and also the IP addresses of the that the hosts names we pointing at.
Can you share the config files that include the check and command config ?
Yes, the service was restarted.
on nagios host:
Code: Select all
tcpdump -n dst host remote_host_ip -vv
on remote host:
Code: Select all
tcpdump -n src host nagios_host_ip -vv
This two commands are presenting errors:
Code: Select all
define command {
command_name check_nrpe
command_line $USER1$/check_nrpe -H '$HOSTADDRESS$' -c '$ARG1$' -t 30:3
}
define command {
command_name check_ssh
command_line $USER1$/check_ssh $ARG1$ $HOSTADDRESS$
}
ssh check config
Code: Select all
define service {
service_description ssh
check_command check_ssh!
check_period 24x7
notification_period 24x7
host_name list of hosts separated by comma
servicegroups ssh
contact_groups +admins,jefecg
use generic-service
}
example nrpe check
Code: Select all
define service {
service_description load
check_command check_nrpe!check_load
check_period 24x7
notification_period 24x7
host_name list of hosts separated by comma
contact_groups +admins,jefecg
use generic-service
}
But let me do some changes on host names, the address are ip address, but host name still are the fqdn of host.
Re: Socket timeout madness
Posted: Wed Sep 19, 2018 9:38 am
by KalimAlRazif
No way

using only IP address on configs did not work

Re: Socket timeout madness
Posted: Thu Sep 20, 2018 4:18 pm
by npolovenko
@KalimAlRazif, Can you run the nmap command with the hosts IP address from the Nagios server and show me the output?
Re: Socket timeout madness
Posted: Mon Sep 24, 2018 1:09 pm
by KalimAlRazif
the output of the command, executed from nagios host
Code: Select all
root@nagios:~# nmap -P0 ip_of_one_of_the_failed_hosts
Starting Nmap 6.00 ( http://nmap.org ) at 2018-09-24 14:04 EDT
Nmap scan report for ipXXX.ip-XX-XX-XX.net (ip_of_one_of_the_failed_hosts)
Host is up (0.090s latency).
Not shown: 996 closed ports
PORT STATE SERVICE
22/tcp open ssh
5666/tcp open nrpe
9101/tcp open jetdirect
9103/tcp open jetdirect
Nmap done: 1 IP address (1 host up) scanned in 1.57 seconds
Code: Select all
root@nagios-new:~# nmap -P0 another_host_with_error
Starting Nmap 6.00 ( http://nmap.org ) at 2018-09-24 14:07 EDT
Nmap scan report for another_host_with_error (another_host_with_error)
Host is up (0.0094s latency).
Not shown: 996 closed ports
PORT STATE SERVICE
22/tcp open ssh
25/tcp open smtp
5666/tcp open nrpe
9102/tcp open jetdirect
Nmap done: 1 IP address (1 host up) scanned in 0.31 seconds
Re: Socket timeout madness
Posted: Tue Sep 25, 2018 12:07 pm
by npolovenko
@KalimAlRazif, You said
if nagios attempts the check no traffic is detected but if I perform the check via normal shell the TCP dump on both machines detect traffic and the check works perfectly.
Can you actually run the command from the command line and show me the output? And then take a screenshot of the command failing in the web interface?
Re: Socket timeout madness
Posted: Wed Sep 26, 2018 3:30 pm
by KalimAlRazif
Sure:
For example ssh is "failing"
Code: Select all
./check_ssh -H remote_ip
SSH OK - OpenSSH_7.2p2 Ubuntu-4ubuntu2.2 (protocol 2.0) | time=0.186555s;;;0.000000;10.000000