check_nrpe works from CLI, fails from server with timeout
Posted: Tue Jun 18, 2013 11:06 am
Today I saw that one of my services was giving an error with the message: "CHECK_NRPE: Socket timeout after 10 seconds."
I figured the service was down so I started checking. The service was up, so Nagios was making a mistake. So I went to the command line on the Nagios server (the one making the check_nrpe call, not the server being probed) and did this:
$ time /usr/lib/nagios/plugins/check_nrpe -H my.hostname -c check_my_nrpe_service
PING OK - Packet loss = 0%, RTA = 88.35 ms|rta=88.345001ms;100.000000;1000.000000;0.000000 pl=0%;10;10;0
real 0m4.129s
user 0m0.008s
sys 0m0.000s
(The "service" to be checked is basically running check_ping).
So, the probed server responds within 5 seconds, but check_nrpe complains about a 10-second timeout.
I have other services on this same server being checked via NRPE (e.g. system load, user load, disk space, etc.) and they all seem to work without a problem.
I searched around and the only promising lead was a badly-cached IP address lookup (which *has* happened to me when configuring iptables and a host's IP address changes), but I double-checked the hostname in the monitor's config file (it's correct), DNS resolves correctly, and I have restarted Nagios entirely just in case there was an incorrect cached DNS lookup. No change in behavior.
Any suggestions?
I figured the service was down so I started checking. The service was up, so Nagios was making a mistake. So I went to the command line on the Nagios server (the one making the check_nrpe call, not the server being probed) and did this:
$ time /usr/lib/nagios/plugins/check_nrpe -H my.hostname -c check_my_nrpe_service
PING OK - Packet loss = 0%, RTA = 88.35 ms|rta=88.345001ms;100.000000;1000.000000;0.000000 pl=0%;10;10;0
real 0m4.129s
user 0m0.008s
sys 0m0.000s
(The "service" to be checked is basically running check_ping).
So, the probed server responds within 5 seconds, but check_nrpe complains about a 10-second timeout.
I have other services on this same server being checked via NRPE (e.g. system load, user load, disk space, etc.) and they all seem to work without a problem.
I searched around and the only promising lead was a badly-cached IP address lookup (which *has* happened to me when configuring iptables and a host's IP address changes), but I double-checked the hostname in the monitor's config file (it's correct), DNS resolves correctly, and I have restarted Nagios entirely just in case there was an incorrect cached DNS lookup. No change in behavior.
Any suggestions?