Hi,
I am using Nagios core to monitor our servers from past 14 months, but all of a sudden from yesterday I am continuously getting host check timed out after 30s for many servers and receiving host down alert and getting host up alerts within a minute. But when I logged in to server and checked, the server was never down. Below is the error message in nagios server :
nagios: job 3470 (pid=9458): read() returned error 11
nagios: wproc: Core Worker 23639: job 3470 (pid=9458) timed out. Killing it
nagios: wproc: CHECK job 3470 from worker Core Worker 23639 timed out after 30.01s
nagios: wproc: host=91d-prod-kfweb-n1; service=(null);
nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
nagios: Warning: Check of host '91d-prod-kfweb-n1' timed out after 30.01 seconds
nagios: HOST ALERT: 91d-prod-kfweb-n1;DOWN;HARD;2;(Host check timed out after 30.01 seconds)
nagios: HOST NOTIFICATION: nagiosadmin;91d-prod-kfweb-n1;DOWN;notify-host-by-email;(Host check timed out after 30.01 seconds)
Someone kindly assist me on how to solve this.
Thanks.
Issue with Nagios core : False host time out alerts
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Issue with Nagios core : False host time out alerts
Are these host checks just doing ping checks?
If so, can you ping the IP for 91d-prod-kfweb-n1 when this happens from the Nagios server?
Can you share one of the host definitions?
If so, can you ping the IP for 91d-prod-kfweb-n1 when this happens from the Nagios server?
Can you share one of the host definitions?
Re: Issue with Nagios core : False host time out alerts
Hi Scott,
For the past year, it was working fine. And yes, I can able to ping during the alert time.
I have attached host definition as requested. Moreover I can find the below error in the monitored server :
nrpe[6141]: Error: (!log_opts) Could not complete SSL handshake with x.x.x.x: 5.
Could not read request from client x.x.x.x, bailing out...
nrpe[31526]: INFO: SSL Socket Shutdown.
But no change was done in the Nagios server in recent times for me to receive the above error.
Kindly suggest.
Thanks.
For the past year, it was working fine. And yes, I can able to ping during the alert time.
I have attached host definition as requested. Moreover I can find the below error in the monitored server :
nrpe[6141]: Error: (!log_opts) Could not complete SSL handshake with x.x.x.x: 5.
Could not read request from client x.x.x.x, bailing out...
nrpe[31526]: INFO: SSL Socket Shutdown.
But no change was done in the Nagios server in recent times for me to receive the above error.
Kindly suggest.
Thanks.
- Attachments
-
- Host_definition.txt
- PFA host definition
- (2.06 KiB) Downloaded 134 times
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Issue with Nagios core : False host time out alerts
Can you attempt the following the next time this happens (replacing x.x.x.x with the IP of the server)
/usr/local/nagios/libexec/check_icmp -H x.x.x.x -w 3000.0,80% -c 5000.0,100% -p 5
This should be what Nagios should be attempting to execute.
Also, can we confirm there are not multiple nagios parent processes
/usr/local/nagios/libexec/check_icmp -H x.x.x.x -w 3000.0,80% -c 5000.0,100% -p 5
This should be what Nagios should be attempting to execute.
Also, can we confirm there are not multiple nagios parent processes
Code: Select all
ps -ef|grep nagios.cfg
This makes me wonder even more if something is happening on the network causing traffic to not route correctlyMoreover I can find the below error in the monitored server :
nrpe[6141]: Error: (!log_opts) Could not complete SSL handshake with x.x.x.x: 5.
Could not read request from client x.x.x.x, bailing out...
nrpe[31526]: INFO: SSL Socket Shutdown.
Re: Issue with Nagios core : False host time out alerts
Sure next time I'll run the nagios check_icmp command. Below is the output of ps -ef | grep nagios.cfg
So there are 2 parent process .
nagios 18448 1 0 Oct19 ? 00:01:50 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 18454 18448 0 Oct19 ? 00:00:14 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
What can I do ?
So there are 2 parent process .
nagios 18448 1 0 Oct19 ? 00:01:50 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 18454 18448 0 Oct19 ? 00:00:14 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
What can I do ?
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Issue with Nagios core : False host time out alerts
That's actually just 1 parent and one child, so that is normal. At this point we need to see what the server sees by running the command when it is reporting the timeoutsitssameer wrote:So there are 2 parent process .
nagios 18448 1 0 Oct19 ? 00:01:50 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 18454 18448 0 Oct19 ? 00:00:14 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
What can I do ?