Issue with Nagios core : False host time out alerts

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
itssameer
Posts: 5
Joined: Sat Oct 19, 2019 3:05 am

Issue with Nagios core : False host time out alerts

Post by itssameer »

Hi,

I am using Nagios core to monitor our servers from past 14 months, but all of a sudden from yesterday I am continuously getting host check timed out after 30s for many servers and receiving host down alert and getting host up alerts within a minute. But when I logged in to server and checked, the server was never down. Below is the error message in nagios server :

nagios: job 3470 (pid=9458): read() returned error 11
nagios: wproc: Core Worker 23639: job 3470 (pid=9458) timed out. Killing it
nagios: wproc: CHECK job 3470 from worker Core Worker 23639 timed out after 30.01s
nagios: wproc: host=91d-prod-kfweb-n1; service=(null);
nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
nagios: Warning: Check of host '91d-prod-kfweb-n1' timed out after 30.01 seconds
nagios: HOST ALERT: 91d-prod-kfweb-n1;DOWN;HARD;2;(Host check timed out after 30.01 seconds)
nagios: HOST NOTIFICATION: nagiosadmin;91d-prod-kfweb-n1;DOWN;notify-host-by-email;(Host check timed out after 30.01 seconds)

Someone kindly assist me on how to solve this.

Thanks.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Issue with Nagios core : False host time out alerts

Post by scottwilkerson »

Are these host checks just doing ping checks?

If so, can you ping the IP for 91d-prod-kfweb-n1 when this happens from the Nagios server?

Can you share one of the host definitions?
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
itssameer
Posts: 5
Joined: Sat Oct 19, 2019 3:05 am

Re: Issue with Nagios core : False host time out alerts

Post by itssameer »

Hi Scott,

For the past year, it was working fine. And yes, I can able to ping during the alert time.

I have attached host definition as requested. Moreover I can find the below error in the monitored server :

nrpe[6141]: Error: (!log_opts) Could not complete SSL handshake with x.x.x.x: 5.
Could not read request from client x.x.x.x, bailing out...
nrpe[31526]: INFO: SSL Socket Shutdown.

But no change was done in the Nagios server in recent times for me to receive the above error.

Kindly suggest.


Thanks.
Attachments
Host_definition.txt
PFA host definition
(2.06 KiB) Downloaded 133 times
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Issue with Nagios core : False host time out alerts

Post by scottwilkerson »

Can you attempt the following the next time this happens (replacing x.x.x.x with the IP of the server)

/usr/local/nagios/libexec/check_icmp -H x.x.x.x -w 3000.0,80% -c 5000.0,100% -p 5

This should be what Nagios should be attempting to execute.

Also, can we confirm there are not multiple nagios parent processes

Code: Select all

ps -ef|grep nagios.cfg
Moreover I can find the below error in the monitored server :

nrpe[6141]: Error: (!log_opts) Could not complete SSL handshake with x.x.x.x: 5.
Could not read request from client x.x.x.x, bailing out...
nrpe[31526]: INFO: SSL Socket Shutdown.
This makes me wonder even more if something is happening on the network causing traffic to not route correctly
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
itssameer
Posts: 5
Joined: Sat Oct 19, 2019 3:05 am

Re: Issue with Nagios core : False host time out alerts

Post by itssameer »

Sure next time I'll run the nagios check_icmp command. Below is the output of ps -ef | grep nagios.cfg

So there are 2 parent process .

nagios 18448 1 0 Oct19 ? 00:01:50 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 18454 18448 0 Oct19 ? 00:00:14 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg


What can I do ?
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Issue with Nagios core : False host time out alerts

Post by scottwilkerson »

itssameer wrote:So there are 2 parent process .

nagios 18448 1 0 Oct19 ? 00:01:50 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 18454 18448 0 Oct19 ? 00:00:14 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg


What can I do ?
That's actually just 1 parent and one child, so that is normal. At this point we need to see what the server sees by running the command when it is reporting the timeouts
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
Locked