Page 1 of 3
No answer from host
Posted: Mon Apr 24, 2017 8:45 am
by SETR
For many of our VM's(Both Windows and Linux) we keep getting either "No answer from host" or "ERROR: Process name table : No response from remote host". We tried everything we could from our side but no luck so far. Any help in resolving this issue is
What version of Nagios XI are you using? 5.4.3
VMware Image or Manual Install of XI? VMware image
Windows version: 2008 R2, 2012, 2003
Linux : RedHat
user account: Local Admin account
Protocol used : SNMP, NCPA
Re: No answer from host
Posted: Mon Apr 24, 2017 11:57 am
by dwhitfield
Could you be a little more specific about what you tried?
What's the output of
ps -ef | grep "bin/nagios"?
If you have multiple processes, please run through the following. Note: you can install
killall with
yum install psmisc
Code: Select all
service nagios stop
killall nagios
service ndo2db stop
service ndo2db start
service nagios start
If that doesn't resolve your issue, can you PM me your Profile? You can download it by going to Admin > System Config > System Profile and click the Download Profile button towards the top. If for whatever reason you *cannot* download the profile, please put the output of View System Info (5.3.4+, Show Profile if older) in the thread (that will at least get us some info). This will give us access to many of the logs we would otherwise ask for individually. If security is a concern, you can unzip the profile take out what you like, and then zip it up again. We may end up needing something you remove, but we can ask for that specifically.
After you PM the profile, please update this thread. Updating this thread is the only way for it to show back up on our dashboard.
Also, how often is this happening? Was it working for a period of time and then changed, and is failing consistently? You didn't change servers and restore from a previous backup did you? Can you verify the IP is right in the service configuration?
UPDATE: profile received and shared with techs
Re: No answer from host
Posted: Wed Apr 26, 2017 10:31 am
by SETR
I have just sent you a PM with the our profile.
Re: No answer from host
Posted: Wed Apr 26, 2017 4:30 pm
by dwhitfield
Thank you for the profile.
It's possible you did this first step, but didn't give the output because you sent the profile but I'll ask again: What's the output of
ps -ef | grep "bin/nagios"?
If you have multiple processes, please run through the following. Note: you can install killall with yum install psmisc
Code: Select all
service nagios stop
killall nagios
service ndo2db stop
service ndo2db start
service nagios start
You might have missed this after the big about the profile, but...how often is this happening? Was it working for a period of time and then changed, and is failing consistently? You didn't change servers and restore from a previous backup did you? Can you verify the IP is right in the service configuration
It kinda sounds like an intermittent issue, which suggests a network/DNS problem.
Re: No answer from host
Posted: Thu Apr 27, 2017 11:02 am
by medleyb
Answers in order. Before the second reset:
LUTL-NAGIOSXI.ams.com>>/root>>ps -ef | grep bin/nagios
nagios 19352 1 0 Apr24 ? 00:32:31 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 19354 19352 0 Apr24 ? 00:02:30 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 19355 19352 0 Apr24 ? 00:02:32 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 19356 19352 0 Apr24 ? 00:02:30 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 19357 19352 0 Apr24 ? 00:02:31 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 19358 19352 0 Apr24 ? 00:02:29 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 19359 19352 0 Apr24 ? 00:02:29 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 19424 19352 0 Apr24 ? 00:00:18 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root 32300 9984 0 11:29 pts/0 00:00:00 grep bin/nagios
ran the script with killall\
LUTL-NAGIOSXI.ams.com>>/root>>service nagios stop
Stopping nagios:. done.
LUTL-NAGIOSXI.ams.com>>/root>>killall nagios
nagios: no process killed
LUTL-NAGIOSXI.ams.com>>/root>>service ndo2db stop
Stopping ndo2db: done.
LUTL-NAGIOSXI.ams.com>>/root>>service ndo2db start
Starting ndo2db: done.
LUTL-NAGIOSXI.ams.com>>/root>>service nagios start
Starting nagios: done.
ps -ef | grep bin/nagios shows this:
LUTL-NAGIOSXI.ams.com>>/root>>ps -ef | grep bin/nagios
nagios 34588 1 1 11:32 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 34590 34588 0 11:32 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 34591 34588 0 11:32 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 34592 34588 0 11:32 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 34593 34588 0 11:32 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 34594 34588 0 11:32 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 34595 34588 0 11:32 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 34607 34588 0 11:32 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root 35103 9984 0 11:33 pts/0 00:00:00 grep bin/nagios
Same output.
The issue is happening daily across several hundred snmp service checks (and we have seen this on ncpa service checks also). It worked briefly for a time and then we added several VMs. We are up to about 40% of the VMs we want to monitor so we expect this issue to be more pervasive when the other VMs are added. We have been running on the same server since we started the trial; server was not changed and no backup restored. And I do not see how it can be a n/w issue since these are running in the same LAN rack on a gigabit backend.
Please advise
Bert Medley for Soumya V.
Re: No answer from host
Posted: Thu Apr 27, 2017 3:28 pm
by tgriep
The message that you are describing sounds like the plugins are timing out when they are running.
Most plugins have a short default timeout and as more hosts and services are checked, the longer things take and then the plugins could start to timeout.
You can increase the timeout of the SNMP commands by going in to the Core Config Manager > Commands menu, search for the commands that have SNMP in the name and add the following to the command line.
This will increase the timeout to 60 seconds.
The NCPA plugin uses a capical T so you would add the following.
Try that out and let us know if this helps with the timeouts.
Re: No answer from host
Posted: Thu May 18, 2017 9:18 am
by SETR
Unfortunately, this solution did not help. We are still facing this issue.
Re: No answer from host
Posted: Thu May 18, 2017 11:53 am
by tgriep
Can you run the following against one of the Windows System that is randomly failing and post the output? Replace xxx.xxx.xxx.xxx with theIP address and also the communitystring has to be changed.
Code: Select all
time /usr/local/nagios/libexec/check_snmp_storage.pl -H xxx.xxx.xxx.xxx -C communitystring --v2c -m 'Physical Memory' -w 80 -c 90 -f -t 60 --octetlength=65535
Then, can you PM me a new System Profile as well as some of the host and service names that are failing so I can check the settings for them?
Re: No answer from host
Posted: Fri May 19, 2017 11:27 am
by SETR
LUTL-NAGIOSXI.ams.com>>/root>>time /usr/local/nagios/libexec/check_snmp_storage.pl -H xxx.xx.xx.xx -C ARM1 --v2c -m 'Physical Memory' -w 80 -c 90 -f -t 60 --octetlength=65535
ERROR: General time-out (Alarm signal)
real 0m15.176s
user 0m0.163s
sys 0m0.013s
LUTL-NAGIOSXI.ams.com>>/root>>
Re: No answer from host
Posted: Fri May 19, 2017 12:29 pm
by tgriep
I took a look at the plugin and there is a separate timeout setting in the plugin so that is why it timed out at 15 seconds and not 60 like the command like option is set for.
To increase the other timeout setting edit the /usr/local/nagios/libexec/check_snmp_storage.pl file and change this like from
to
Save the file and from now on, the internal timeout will be increased to 60 seconds and hopefully the false errors will go away.
FYI, if you upgrade the XI server or the Windows Wizard, the plugin could get overwritten and the changes will have to be done again.