No_Response from Remote Host

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
User avatar
Kriyeshh
Posts: 18
Joined: Wed May 13, 2015 5:15 pm
Location: India

No_Response from Remote Host

Post by Kriyeshh »

Hi All,

Whatsup guys, but really am banging my head!!!!!

I have a centralized Nagios Server which monitor's my other Production Servers. Everything works cool except in certain scenario.
Frequently I am receiving alerts stating "NO RESPONSE FROM REMOTE HOST". I tried goggling my best but still got stuck.
The problem is only for a certain plugin "check_snmp_storage.pl" and only happen once or twice a month.

My snmp version is NET-SNMP 5.5 and using Nagios Core 4.0.8.

###############service definition###################

define service{
use local-service
host_name My_Machine
service_description D_Mount
check_command check_snmp_storage!"^/$"!85!90!
notifications_enabled 1
max_check_attempts 5
check_interval 6
retry_interval 1
check_period 24x7
}


I known there is huge possibility for packet loss in SNMP because of UDP protocol which is not reliable like TCP. So i tried to extend the timeout value in the "check_snmp_storage.pl" but still i received the same error after few days. I need the puzzle behind this error.
#what is cause?
#what should be done?
#how to avoid these kinda errors?

So Naigos playesr please help this newie. :shock:
Suggestions and help are highly appreciated.

-Kriyeshh
Last edited by Kriyeshh on Fri Jul 31, 2015 2:43 pm, edited 1 time in total.
Cheers,
-Kriyeshh
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: No_Response from Remote Host

Post by jdalrymple »

My first guess is that some external circumstance is causing your array longer than normal to respond. How long does the check normally take to run? If you run it from localhost (meaning on the remote machine) does it go much faster? If so you might be able to solve your problem by using check_by_ssh or check_nrpe combined with your snmp check.
User avatar
Kriyeshh
Posts: 18
Joined: Wed May 13, 2015 5:15 pm
Location: India

Re: No_Response from Remote Host

Post by Kriyeshh »

Yes Mr.jdalrymple,
The server responds in seconds with data.
The SNMP request is getting trashed only in a particular scenario as i stated earlier.
I guess for CPU and Memory overload, which delays the execution where Nagios will be getting no response, thus states "No Response from remote server"
Any Ideas?
Cheers,
-Kriyeshh
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: No_Response from Remote Host

Post by tgriep »

Are you running a host check of that system and if so, do you see anything there that could help debugging this?
Are there any backups happening or any high load on the system that could be causing it?
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
Kriyeshh
Posts: 18
Joined: Wed May 13, 2015 5:15 pm
Location: India

Re: No_Response from Remote Host

Post by Kriyeshh »

Hey Tgriep,

Thanks for your reply.
Let me toggle the answers for you questions.

Are you running a host check of that system and if so, do you see anything there that could help debugging this?

Yes, hostcheck of that system is been enabled and the hostcheck too fails on that particular time scenario. :-(

Are there any backups happening or any high load on the system that could be causing it?

I too suspected this and i logged the both the servers (Nagios Server and the Server Monitored) process,memory stack for the particular time scene where SNMPD gets punished.
#What is found is, i got a confirmation from the Production server that it has validated the SNMPD request from the Nagios Server.
#On that particular time scenario, a single process (backup process) peeks the memory and cpu utilization to maxmimum (But that too for merely a minute).
#But other side, i have nagios timeout for constant +3 minutes.
#Also i checked for system updates, but nothing like that is happening.

I believe that the Process which is consuming my memory and cpu should be the root cause for this.
Also I checked for kernel log and found "Selinux block-on some files". May be selinux might be the thief?

Attaching logs and proofs for you review.

dmesg.log
SELinux: initialized (dev sdf12, type ext3), uses xattr
type=1400 audit(1437886427.155:3): avc: denied { read } for pid=999 comm="pam_console_app" name="fstab" dev=sda2 ino=63637 scontext=system_u:system_r:pam_console_t:s0 tcontext=system_u:object_r:file_t:s0 tclass=file
type=1400 audit(1437886427.155:4): avc: denied { getattr } for pid=999 comm="pam_console_app" path="/etc/fstab" dev=sda2 ino=63637 scontext=system_u:system_r:pam_console_t:s0 tcontext=system_u:object_r:file_t:s0 tclass=file
Adding 19570584k swap on /swapmemory/swapfile. Priority:-1 extents:4931 across:19938584k

syslog
Jul 30 23:30:16 dhclient: DHCPREQUEST on eth0 to 10.0.0.1 port 67
Jul 30 23:30:16 dhclient: DHCPACK from 10.0.0.1
Jul 30 23:30:16 dhclient: bound to 10.0.0.151 -- renewal in 1432 seconds.
Jul 30 23:30:16 setroubleshoot: SELinux is preventing access to files with the label, file_t. For complete SELinux messages. run sealert -l 0f0b393a-1629-4976-9d51-e76cc937163a
Jul 30 23:30:16 setroubleshoot: SELinux is preventing access to files with the label, file_t. For complete SELinux messages. run sealert -l 13e20a79-598c-4c3e-949d-00acf7a46ce0
Jul 30 23:30:24 snmpd[1676]: Connection from UDP: [10.0.0.226]:50424 (snmpd confirmation)
Jul 30 23:30:24 snmpd[1676]: Received SNMP packet(s) from UDP: [10.0.0.226]:50424
Jul 30 23:30:24 snmpd[1676]: Connection from UDP: [10.0.0.226]:44803

Audit.log
type=AVC msg=audit(1438314848.048:6131): avc: denied { read } for pid=31517 comm="hostname" name="config" dev=sda2 ino=62733 scontext=system_u:system_r:hostname_t:s0 tcontext=system_u:object_r:file_t:s0 tclass=file
type=AVC msg=audit(1438314848.048:6132): avc: denied { getattr } for pid=31517 comm="hostname" path="/etc/selinux/config" dev=sda2 ino=62733 scontext=system_u:system_r:hostname_t:s0 tcontext=system_u:object_r:file_t:s0 tclass=file

ps.output
Thu Jul 30 23:00:01 EDT 2015
PID %CPU %MEM STAT COMMAND
3076 0.1 0.1 Ss ora_rsm0_
3012 0.0 4.7 Ss ora_dbw0_
3014 0.0 4.6 Ss ora_dbw1_
3008 0.0 0.1 Ss ora_dia0_
30626 0.0 0.0 Ss sshd: root@pts/0
112 0.0 0.0 S< [kswapd0]
3093 0.0 4.5 Ss ora_pr01_
1676 0.0 0.0 Sl /usr/sbin/snmpd -Lsd -Lf /dev/null -p /var/run/snmpd.pid -a
3095 0.0 4.5 Ss ora_pr02_
Cheers,
-Kriyeshh
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: No_Response from Remote Host

Post by jdalrymple »

I personally don't think your problem is selinux. selinux doesn't adapt its behavior to and fro to the best of my knowledge. If the check was failing courtesy of selinux I would expect it to fail every time.

That said, I'm pretty convinced that your problem is load related. My best advice is to identify what the load is and if it is indeed spiking at the time of failure, ID what is causing it. Do the check failures happen on any sort of predictable schedule?

My second best advice is convert this to a passive check.

My last best advice is to leverage max_check_attempts and retry_interval as they were intended to be used. Filter out the transient states, seems especially tolerable on a disk check.
User avatar
Kriyeshh
Posts: 18
Joined: Wed May 13, 2015 5:15 pm
Location: India

Re: No_Response from Remote Host

Post by Kriyeshh »

Hi Jdalrymple,
I really do appreciate you suggestion.

#Yeah you are write, if SELINUX is the problem it should block always, not on a particular time.
#About LOAD, that is the think which i am too analyzing. But for my bad, i couldn't identify the particular Process. Anyways, one more time let me try to capture the load on the particular time schedule and will come back with the result.
#RETRY_INTERVALS-Yeah, its already been fine tuned but still no grace :oops:

We will wait for tomorrow so that i can capture the process........ :twisted:
Last edited by Kriyeshh on Mon Aug 03, 2015 2:56 pm, edited 1 time in total.
Cheers,
-Kriyeshh
User avatar
Kriyeshh
Posts: 18
Joined: Wed May 13, 2015 5:15 pm
Location: India

Re: No_Response from Remote Host

Post by Kriyeshh »

Friends,

I forgot to mention one thing here.
SNMPD check alone fails on that particular time, other service checks like SSH, PING, HTTP are working fine....
It shows the trouble is with the SNMPD check which we used to check the Mount points. May be due to memory load happening due to process schedule it might be get killed?

Any ideas?
Cheers,
-Kriyeshh
User avatar
Kriyeshh
Posts: 18
Joined: Wed May 13, 2015 5:15 pm
Location: India

Re: No_Response from Remote Host

Post by Kriyeshh »

Friends,

By the way,is there any option to schedule downtime for a service only for 5 minutes 24*7*365 or can i disable notification interval for that five minutes alone?

Just wondering if there is a loophole!!! 8-)
Cheers,
-Kriyeshh
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: No_Response from Remote Host

Post by jdalrymple »

Kriyeshh wrote:Friends,

By the way,is there any option to schedule downtime for a service only for 5 minutes 24*7*365 or can i disable notification interval for that five minutes alone?

Just wondering if there is a loophole!!! 8-)
https://assets.nagios.com/downloads/nag ... riods.html

If the outage window is that predictable you should be able to get around it with a timeperiod or alternatively continue to adjust max_check_attempts and retry_interval appropriately. Additionally, if it's that predictable it should be possible to identify the problem by looking at /var/log/cron and/or being on the system when it happens.
Locked