Page 1 of 1

Issue with Nagios core : False host time out alerts

Posted: Sat Oct 19, 2019 3:18 am
by itssameer
Hi,

I am using Nagios core to monitor our servers from past 14 months, but all of a sudden from yesterday I am continuously getting host check timed out after 30s for many servers and receiving host down alert and getting host up alerts within a minute. But when I logged in to server and checked, the server was never down. Below is the error message in nagios server :

nagios: job 3470 (pid=9458): read() returned error 11
nagios: wproc: Core Worker 23639: job 3470 (pid=9458) timed out. Killing it
nagios: wproc: CHECK job 3470 from worker Core Worker 23639 timed out after 30.01s
nagios: wproc: host=91d-prod-kfweb-n1; service=(null);
nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
nagios: Warning: Check of host '91d-prod-kfweb-n1' timed out after 30.01 seconds
nagios: HOST ALERT: 91d-prod-kfweb-n1;DOWN;HARD;2;(Host check timed out after 30.01 seconds)
nagios: HOST NOTIFICATION: nagiosadmin;91d-prod-kfweb-n1;DOWN;notify-host-by-email;(Host check timed out after 30.01 seconds)

Someone kindly assist me on how to solve this.

Thanks.

Re: Issue with Nagios core : False host time out alerts

Posted: Mon Oct 21, 2019 9:28 am
by scottwilkerson
Are these host checks just doing ping checks?

If so, can you ping the IP for 91d-prod-kfweb-n1 when this happens from the Nagios server?

Can you share one of the host definitions?

Re: Issue with Nagios core : False host time out alerts

Posted: Mon Oct 21, 2019 12:46 pm
by itssameer
Hi Scott,

For the past year, it was working fine. And yes, I can able to ping during the alert time.

I have attached host definition as requested. Moreover I can find the below error in the monitored server :

nrpe[6141]: Error: (!log_opts) Could not complete SSL handshake with x.x.x.x: 5.
Could not read request from client x.x.x.x, bailing out...
nrpe[31526]: INFO: SSL Socket Shutdown.

But no change was done in the Nagios server in recent times for me to receive the above error.

Kindly suggest.


Thanks.

Re: Issue with Nagios core : False host time out alerts

Posted: Mon Oct 21, 2019 12:56 pm
by scottwilkerson
Can you attempt the following the next time this happens (replacing x.x.x.x with the IP of the server)

/usr/local/nagios/libexec/check_icmp -H x.x.x.x -w 3000.0,80% -c 5000.0,100% -p 5

This should be what Nagios should be attempting to execute.

Also, can we confirm there are not multiple nagios parent processes

Code: Select all

ps -ef|grep nagios.cfg
Moreover I can find the below error in the monitored server :

nrpe[6141]: Error: (!log_opts) Could not complete SSL handshake with x.x.x.x: 5.
Could not read request from client x.x.x.x, bailing out...
nrpe[31526]: INFO: SSL Socket Shutdown.
This makes me wonder even more if something is happening on the network causing traffic to not route correctly

Re: Issue with Nagios core : False host time out alerts

Posted: Mon Oct 21, 2019 1:05 pm
by itssameer
Sure next time I'll run the nagios check_icmp command. Below is the output of ps -ef | grep nagios.cfg

So there are 2 parent process .

nagios 18448 1 0 Oct19 ? 00:01:50 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 18454 18448 0 Oct19 ? 00:00:14 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg


What can I do ?

Re: Issue with Nagios core : False host time out alerts

Posted: Mon Oct 21, 2019 2:55 pm
by scottwilkerson
itssameer wrote:So there are 2 parent process .

nagios 18448 1 0 Oct19 ? 00:01:50 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 18454 18448 0 Oct19 ? 00:00:14 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg


What can I do ?
That's actually just 1 parent and one child, so that is normal. At this point we need to see what the server sees by running the command when it is reporting the timeouts