Issue with Nagios core : False host time out alerts

An open discussion forum for obtaining help with Nagios Core. Nagios Core users of all experience levels are welcome here. Subforum have been created for the discussion of Nagios Core and Nagios Plugin development.

NOTE: The SourceForge.net mailing lists have been deprecated in favor of this forum in order to expedite support and provide additional features not available on the old mailing list.

Issue with Nagios core : False host time out alerts

Postby itssameer » Sat Oct 19, 2019 3:18 am

Hi,

I am using Nagios core to monitor our servers from past 14 months, but all of a sudden from yesterday I am continuously getting host check timed out after 30s for many servers and receiving host down alert and getting host up alerts within a minute. But when I logged in to server and checked, the server was never down. Below is the error message in nagios server :

nagios: job 3470 (pid=9458): read() returned error 11
nagios: wproc: Core Worker 23639: job 3470 (pid=9458) timed out. Killing it
nagios: wproc: CHECK job 3470 from worker Core Worker 23639 timed out after 30.01s
nagios: wproc: host=91d-prod-kfweb-n1; service=(null);
nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
nagios: Warning: Check of host '91d-prod-kfweb-n1' timed out after 30.01 seconds
nagios: HOST ALERT: 91d-prod-kfweb-n1;DOWN;HARD;2;(Host check timed out after 30.01 seconds)
nagios: HOST NOTIFICATION: nagiosadmin;91d-prod-kfweb-n1;DOWN;notify-host-by-email;(Host check timed out after 30.01 seconds)

Someone kindly assist me on how to solve this.

Thanks.
itssameer
 
Posts: 3
Joined: Sat Oct 19, 2019 3:05 am

Re: Issue with Nagios core : False host time out alerts

Postby scottwilkerson » Mon Oct 21, 2019 9:28 am

Are these host checks just doing ping checks?

If so, can you ping the IP for 91d-prod-kfweb-n1 when this happens from the Nagios server?

Can you share one of the host definitions?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
scottwilkerson
DevOps Engineer
 
Posts: 16724
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Re: Issue with Nagios core : False host time out alerts

Postby itssameer » Mon Oct 21, 2019 12:46 pm

Hi Scott,

For the past year, it was working fine. And yes, I can able to ping during the alert time.

I have attached host definition as requested. Moreover I can find the below error in the monitored server :

nrpe[6141]: Error: (!log_opts) Could not complete SSL handshake with x.x.x.x: 5.
Could not read request from client x.x.x.x, bailing out...
nrpe[31526]: INFO: SSL Socket Shutdown.

But no change was done in the Nagios server in recent times for me to receive the above error.

Kindly suggest.


Thanks.
Attachments
Host_definition.txt
PFA host definition
(2.06 KiB) Downloaded 4 times
itssameer
 
Posts: 3
Joined: Sat Oct 19, 2019 3:05 am

Re: Issue with Nagios core : False host time out alerts

Postby scottwilkerson » Mon Oct 21, 2019 12:56 pm

Can you attempt the following the next time this happens (replacing x.x.x.x with the IP of the server)

/usr/local/nagios/libexec/check_icmp -H x.x.x.x -w 3000.0,80% -c 5000.0,100% -p 5

This should be what Nagios should be attempting to execute.

Also, can we confirm there are not multiple nagios parent processes
Code: Select all
ps -ef|grep nagios.cfg


Moreover I can find the below error in the monitored server :

nrpe[6141]: Error: (!log_opts) Could not complete SSL handshake with x.x.x.x: 5.
Could not read request from client x.x.x.x, bailing out...
nrpe[31526]: INFO: SSL Socket Shutdown.


This makes me wonder even more if something is happening on the network causing traffic to not route correctly
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
scottwilkerson
DevOps Engineer
 
Posts: 16724
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Re: Issue with Nagios core : False host time out alerts

Postby itssameer » Mon Oct 21, 2019 1:05 pm

Sure next time I'll run the nagios check_icmp command. Below is the output of ps -ef | grep nagios.cfg

So there are 2 parent process .

nagios 18448 1 0 Oct19 ? 00:01:50 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 18454 18448 0 Oct19 ? 00:00:14 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg


What can I do ?
itssameer
 
Posts: 3
Joined: Sat Oct 19, 2019 3:05 am

Re: Issue with Nagios core : False host time out alerts

Postby scottwilkerson » Mon Oct 21, 2019 2:55 pm

itssameer wrote:So there are 2 parent process .

nagios 18448 1 0 Oct19 ? 00:01:50 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 18454 18448 0 Oct19 ? 00:00:14 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg


What can I do ?


That's actually just 1 parent and one child, so that is normal. At this point we need to see what the server sees by running the command when it is reporting the timeouts
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
scottwilkerson
DevOps Engineer
 
Posts: 16724
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises


Return to Nagios Core

Who is online

Users browsing this forum: Google [Bot] and 9 guests