Hello Team,
My nagios server is a virtual machine, all of sudden, the alerts were triggered for agent "Unable to establish communication with Agent" of 100 servers. I have tried executing the command, for first execution, I got the results and for second execution (immediate after first), got the error "Unable to establish communication with Agent" and it goes on....
The issue persisted for 3 continuous days, now everything is back to normal.
what could be the cause of the issue?
Is it related to network?
We have check the nagios server load, CPU, etc, all looks fine. Similarly we checked with network team, no issues as well.
Please let us know how can we find the cause of this issue to take preventive action.
Thank you,
Padma Muthu
Nagios windows agent issue
Re: Nagios windows agent issue
I am going to assume the agent you are using is NSClient++. Please correct me if I am wrong.
Which plugin is being used on the Nagios Core side of things to reach out to NSClient++? What version of that plugin are you using?
Which version of NSClient++ is being used on your machines? Do you have a standard NSClient++ configuration these machines use and, if so, could you share it?
Which plugin is being used on the Nagios Core side of things to reach out to NSClient++? What version of that plugin are you using?
Which version of NSClient++ is being used on your machines? Do you have a standard NSClient++ configuration these machines use and, if so, could you share it?
Did you also check the Nagios Core machine's available file descriptors, open file limits, and available sockets?padu_3891 wrote:We have check the nagios server load, CPU, etc, all looks fine.
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
Re: Nagios windows agent issue
I am going to assume the agent you are using is NSClient++. You are correct.
Which plugin is being used on the Nagios Core side of things to reach out to NSClient++? What version of that plugin are you using?
Check_nrpe, version 2.12
Which version of NSClient++ is being used on your machines? Do you have a standard NSClient++ configuration these machines use and, if so, could you share it?
Nsclient++ version 4.3.1, yes it is a standard configuration. Do you want to share the nsclient.ini file?
Did you also check the Nagios Core machine's available file descriptors, open file limits, and available sockets?
Yes, everything is fine, no issues found
Which plugin is being used on the Nagios Core side of things to reach out to NSClient++? What version of that plugin are you using?
Check_nrpe, version 2.12
Which version of NSClient++ is being used on your machines? Do you have a standard NSClient++ configuration these machines use and, if so, could you share it?
Nsclient++ version 4.3.1, yes it is a standard configuration. Do you want to share the nsclient.ini file?
Did you also check the Nagios Core machine's available file descriptors, open file limits, and available sockets?
Yes, everything is fine, no issues found
Re: Nagios windows agent issue
Can you share the full historical nagios log that contains these ~100 or so failures? Typically the historical logs can be found here:
I'd like to see the full log from a given day if possible, not just a handful of entries demonstrating the error message.
Which OS and version of that OS is this machine using? Which hypervisor is hosting the VM?
Also, if you happen to have a copy of your system's primary log file (/var/log/messages on CentOS/RHEL) from that same time period, that may be useful.
I'm fairly confident this is some sort of system/network related issue rather than a failure of NSClient++ or the check_nrpe plugin specifically (I could be wrong). I've seen setups executing ~100 or so simultaneous check_nrpe calls to various agents (mostly NSClient++) without totally tanking. Besides that, given how check_nrpe functions, I don't think it would make sense for a few hundred agents to simultaneously stop responding unless there was some sort of network/system issue that prevented check_nrpe from correctly establishing a connection.
Code: Select all
/usr/local/nagios/var/archives
Which OS and version of that OS is this machine using? Which hypervisor is hosting the VM?
Also, if you happen to have a copy of your system's primary log file (/var/log/messages on CentOS/RHEL) from that same time period, that may be useful.
I'm fairly confident this is some sort of system/network related issue rather than a failure of NSClient++ or the check_nrpe plugin specifically (I could be wrong). I've seen setups executing ~100 or so simultaneous check_nrpe calls to various agents (mostly NSClient++) without totally tanking. Besides that, given how check_nrpe functions, I don't think it would make sense for a few hundred agents to simultaneously stop responding unless there was some sort of network/system issue that prevented check_nrpe from correctly establishing a connection.
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
Re: Nagios windows agent issue
Former Nagios employee
Re: Nagios windows agent issue
@mcapra Thanks a lot for your suggestion . as you said i found both the issues . My server resource CPU utlisation was high that may be one cause.
i am going to increase the server resource as of now and let you know if i face more issues .
Just one more query .
Having the nagios server in VMWARE environment will cause any issue ? .. stand alone machine or Virtual machine which one will you suggest ?
i am going to increase the server resource as of now and let you know if i face more issues .
Just one more query .
Having the nagios server in VMWARE environment will cause any issue ? .. stand alone machine or Virtual machine which one will you suggest ?
Re: Nagios windows agent issue
Core can run equally well in a VM or on physical hardware. The differences in performance are minor for the most part, and really don't show themselves until an environment becomes quite large.
Former Nagios employee