[solved] Socket timeout madness

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
KalimAlRazif
Posts: 7
Joined: Sat Sep 15, 2018 4:09 pm

[solved] Socket timeout madness

Post by KalimAlRazif »

Hi,
My nagios 4.2.1 throws a timeout socket on certain hosts, with nrpe and ssh checks, I put a tcp dump on nagios server and one of the affected remote servers, if nagios attempt the check no traffic is detected but if I perform the check via normal shell the tcp dump on both machices detect traffic and the check works perfectly.

This behavior is new, till today the nagios instance was working perfectly. I deleted the logs and retention files, reboot nagios container (lcx) and the remote container too, some of the containers are on another machine and some on the same phisical machine.

Please, any ideas? :?

Thanks in advance
Nomar
Last edited by KalimAlRazif on Wed Sep 26, 2018 3:46 pm, edited 1 time in total.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Socket timeout madness

Post by cdienger »

Are the configured checks configured to use the IP address or hostname of the destination? How exactly are you running the plugins and tcpdump on the command line? If the tcpdump is capturing only a specific port or IP address and the configured checks fail due to a DNS issue, this could produce the behavior you're seeing.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
KalimAlRazif
Posts: 7
Joined: Sat Sep 15, 2018 4:09 pm

Re: Socket timeout madness

Post by KalimAlRazif »

cdienger wrote:Are the configured checks configured to use the IP address or hostname of the destination? How exactly are you running the plugins and tcpdump on the command line? If the tcpdump is capturing only a specific port or IP address and the configured checks fail due to a DNS issue, this could produce the behavior you're seeing.
Hi,
Indeed all of the affected hosts were defined by name, I did change them and now are defined by IP but no difference :cry:
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Socket timeout madness

Post by cdienger »

And the nagios service was restarted after making these changes, correct?

What options are running with the tcpdump ? I would update it to capture port 53 traffic and also the IP addresses of the that the hosts names we pointing at.

Can you share the config files that include the check and command config ?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
KalimAlRazif
Posts: 7
Joined: Sat Sep 15, 2018 4:09 pm

Re: Socket timeout madness

Post by KalimAlRazif »

cdienger wrote:And the nagios service was restarted after making these changes, correct?

What options are running with the tcpdump ? I would update it to capture port 53 traffic and also the IP addresses of the that the hosts names we pointing at.

Can you share the config files that include the check and command config ?
Yes, the service was restarted.

on nagios host:

Code: Select all

tcpdump -n dst host remote_host_ip -vv
on remote host:

Code: Select all

tcpdump -n src host nagios_host_ip -vv
This two commands are presenting errors:

Code: Select all

define command {
                command_name                          check_nrpe
                command_line                          $USER1$/check_nrpe -H '$HOSTADDRESS$' -c '$ARG1$' -t 30:3
}
define command {
                command_name                          check_ssh
                command_line                          $USER1$/check_ssh $ARG1$ $HOSTADDRESS$
}
ssh check config

Code: Select all

define service {
                service_description                   ssh
                check_command                         check_ssh!
                check_period                          24x7
                notification_period                   24x7
                host_name                             list of hosts separated by comma
                servicegroups                         ssh
                contact_groups                        +admins,jefecg
                use                                   generic-service
}
example nrpe check

Code: Select all

define service {
                service_description                   load
                check_command                         check_nrpe!check_load
                check_period                          24x7
                notification_period                   24x7
                host_name                             list of hosts separated by comma
                contact_groups                        +admins,jefecg
                use                                   generic-service
}
But let me do some changes on host names, the address are ip address, but host name still are the fqdn of host.
KalimAlRazif
Posts: 7
Joined: Sat Sep 15, 2018 4:09 pm

Re: Socket timeout madness

Post by KalimAlRazif »

No way :-( using only IP address on configs did not work :-(
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: Socket timeout madness

Post by npolovenko »

@KalimAlRazif, Can you run the nmap command with the hosts IP address from the Nagios server and show me the output?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
KalimAlRazif
Posts: 7
Joined: Sat Sep 15, 2018 4:09 pm

Re: Socket timeout madness

Post by KalimAlRazif »

the output of the command, executed from nagios host

Code: Select all

root@nagios:~# nmap -P0 ip_of_one_of_the_failed_hosts

Starting Nmap 6.00 ( http://nmap.org ) at 2018-09-24 14:04 EDT
Nmap scan report for ipXXX.ip-XX-XX-XX.net (ip_of_one_of_the_failed_hosts)
Host is up (0.090s latency).
Not shown: 996 closed ports
PORT     STATE SERVICE
22/tcp   open  ssh
5666/tcp open  nrpe
9101/tcp open  jetdirect
9103/tcp open  jetdirect

Nmap done: 1 IP address (1 host up) scanned in 1.57 seconds

Code: Select all

root@nagios-new:~# nmap -P0 another_host_with_error

Starting Nmap 6.00 ( http://nmap.org ) at 2018-09-24 14:07 EDT
Nmap scan report for another_host_with_error (another_host_with_error)
Host is up (0.0094s latency).
Not shown: 996 closed ports
PORT     STATE SERVICE
22/tcp   open  ssh
25/tcp   open  smtp
5666/tcp open  nrpe
9102/tcp open  jetdirect

Nmap done: 1 IP address (1 host up) scanned in 0.31 seconds
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: Socket timeout madness

Post by npolovenko »

@KalimAlRazif, You said
if nagios attempts the check no traffic is detected but if I perform the check via normal shell the TCP dump on both machines detect traffic and the check works perfectly.
Can you actually run the command from the command line and show me the output? And then take a screenshot of the command failing in the web interface?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
KalimAlRazif
Posts: 7
Joined: Sat Sep 15, 2018 4:09 pm

Re: Socket timeout madness

Post by KalimAlRazif »

Sure:


For example ssh is "failing"

Code: Select all

./check_ssh -H remote_ip
SSH OK - OpenSSH_7.2p2 Ubuntu-4ubuntu2.2 (protocol 2.0) | time=0.186555s;;;0.000000;10.000000
Attachments
Screenshot from 2018-09-26 16-25-47.png
Locked