Issue with Nagios core : False host time out alerts

itssameer · Post by **itssameer** » Sat Oct 19, 2019 3:18 am

Hi,

I am using Nagios core to monitor our servers from past 14 months, but all of a sudden from yesterday I am continuously getting host check timed out after 30s for many servers and receiving host down alert and getting host up alerts within a minute. But when I logged in to server and checked, the server was never down. Below is the error message in nagios server :

nagios: job 3470 (pid=9458): read() returned error 11
nagios: wproc: Core Worker 23639: job 3470 (pid=9458) timed out. Killing it
nagios: wproc: CHECK job 3470 from worker Core Worker 23639 timed out after 30.01s
nagios: wproc: host=91d-prod-kfweb-n1; service=(null);
nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
nagios: Warning: Check of host '91d-prod-kfweb-n1' timed out after 30.01 seconds
nagios: HOST ALERT: 91d-prod-kfweb-n1;DOWN;HARD;2;(Host check timed out after 30.01 seconds)
nagios: HOST NOTIFICATION: nagiosadmin;91d-prod-kfweb-n1;DOWN;notify-host-by-email;(Host check timed out after 30.01 seconds)

Someone kindly assist me on how to solve this.

Thanks.

scottwilkerson · Post by **scottwilkerson** » Mon Oct 21, 2019 9:28 am

Are these host checks just doing ping checks?

If so, can you ping the IP for 91d-prod-kfweb-n1 when this happens from the Nagios server?

Can you share one of the host definitions?

itssameer · Post by **itssameer** » Mon Oct 21, 2019 12:46 pm

Hi Scott,

For the past year, it was working fine. And yes, I can able to ping during the alert time.

I have attached host definition as requested. Moreover I can find the below error in the monitored server :

nrpe[6141]: Error: (!log_opts) Could not complete SSL handshake with x.x.x.x: 5.
Could not read request from client x.x.x.x, bailing out...
nrpe[31526]: INFO: SSL Socket Shutdown.

But no change was done in the Nagios server in recent times for me to receive the above error.

Kindly suggest.

Thanks.

scottwilkerson · Post by **scottwilkerson** » Mon Oct 21, 2019 12:56 pm

Can you attempt the following the next time this happens (replacing x.x.x.x with the IP of the server)

/usr/local/nagios/libexec/check_icmp -H x.x.x.x -w 3000.0,80% -c 5000.0,100% -p 5

This should be what Nagios should be attempting to execute.

Also, can we confirm there are not multiple nagios parent processes

Code: Select all

ps -ef|grep nagios.cfg

Moreover I can find the below error in the monitored server :

nrpe[6141]: Error: (!log_opts) Could not complete SSL handshake with x.x.x.x: 5.
Could not read request from client x.x.x.x, bailing out...
nrpe[31526]: INFO: SSL Socket Shutdown.

This makes me wonder even more if something is happening on the network causing traffic to not route correctly

itssameer · Post by **itssameer** » Mon Oct 21, 2019 1:05 pm

Sure next time I'll run the nagios check_icmp command. Below is the output of ps -ef | grep nagios.cfg

So there are 2 parent process .

nagios 18448 1 0 Oct19 ? 00:01:50 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 18454 18448 0 Oct19 ? 00:00:14 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

What can I do ?

scottwilkerson · Post by **scottwilkerson** » Mon Oct 21, 2019 2:55 pm

itssameer wrote:So there are 2 parent process .

nagios 18448 1 0 Oct19 ? 00:01:50 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 18454 18448 0 Oct19 ? 00:00:14 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

What can I do ?

That's actually just 1 parent and one child, so that is normal. At this point we need to see what the server sees by running the command when it is reporting the timeouts

Nagios Support Forum

Issue with Nagios core : False host time out alerts

Issue with Nagios core : False host time out alerts

Re: Issue with Nagios core : False host time out alerts

Re: Issue with Nagios core : False host time out alerts

Re: Issue with Nagios core : False host time out alerts

Re: Issue with Nagios core : False host time out alerts

Re: Issue with Nagios core : False host time out alerts