Page 1 of 1

Experiencing timeouts on multiple servers - Nagios 3.x

Posted: Wed Apr 05, 2017 12:16 pm
by jonescl2
Hello...I'm trying to troubleshoot an issue we are experiencing across a few of our Nagios servers. We are seeing a high volume of service check and plugin time outs on a couple dozen servers. Most of these servers are windows 2012 r2. The technical teams have apparently eliminated any network issues and so far have not found anything on the servers themselves that would be causing this.
There are servers along side the ones we are having issues with that reside in the same network and have the same settings but are functioning normally.
This issue just cropped up last week after some patching and OS upgrades. Patches have been rolled back and still no change.
Any information you may have would be appreciated. We also ran the same wmi queries from the command line (outside of groundwork), using the same credentials, and still see the issue. Most times we can get one response back normally but then subsequent attempts time out.
These servers were fine in Nagios up until a week ago. No changes have been made to Nagios.
Thank you

Re: Experiencing timeouts on multiple servers - Nagios 3.x

Posted: Wed Apr 05, 2017 2:33 pm
by mcapra
jonescl2 wrote: We also ran the same wmi queries from the command line (outside of groundwork), using the same credentials, and still see the issue. Most times we can get one response back normally but then subsequent attempts time out.
WMI queries are notoriously slow.

Can you share a bit more information about how you are monitoring these Windows machines:
  • Are you using an agent? If so, which agent and which version?
  • Which plugin are you using to execute your Nagios checks? check_nrpe, check_nt, check_wmi_plus.pl, etc
  • Can you share some sample host/service definitions of your Windows hosts?

Re: Experiencing timeouts on multiple servers - Nagios 3.x

Posted: Wed Apr 12, 2017 7:57 am
by jonescl2
Hi...thanks for your reply. I'll share what I can.

Are you using an agent? If so, which agent and which version?
- Agentless. 95% of our environment is agentless. We are seeing the issue on maybe 1% of the servers so far.

Which plugin are you using to execute your Nagios checks? check_nrpe, check_nt, check_wmi_plus.pl, etc
- check_wmi_plus_domain.pl

Can you share some sample host/service definitions of your Windows hosts?
- here are the specs for our CPU check. this is consistent across all of our windows servers. let me know if you need anything else.
service name: win_cpu_wmi_do
check command: check_wmi_win_cpu_domain
command definition: $USER1$/check_wmi_plus_domain.pl -H $HOSTADDRESS$ -m checkcpu -D $HOSTALIAS$ -w $ARG1$ -c $ARG2$
usage: check_wmi_win_cpu_domain!ARG1!ARG2
command line: check_wmi_win_cpu_domain!90!95

Re: Experiencing timeouts on multiple servers - Nagios 3.x

Posted: Wed Apr 12, 2017 12:27 pm
by tacolover101
We also ran the same wmi queries from the command line (outside of groundwork)
you should probably contact groundwork for support as it could vary slightly how it's operating.

if it's still happening on the CLI though as you mention, then i generally think it's a windows problem. might be worth diving in with the wmic command to see if you can troubleshoot further.

Re: Experiencing timeouts on multiple servers - Nagios 3.x

Posted: Wed Apr 12, 2017 1:19 pm
by jonescl2
tacolover101 wrote:
We also ran the same wmi queries from the command line (outside of groundwork)
you should probably contact groundwork for support as it could vary slightly how it's operating.

if it's still happening on the CLI though as you mention, then i generally think it's a windows problem. might be worth diving in with the wmic command to see if you can troubleshoot further.
Thank you...we've explored and tested WMI on the servers and everything checks out ok. The tech team was working Microsoft on the Windows side.
We are leaning toward a network issue. Authentication and RPC ping work every time, but DCOM/RPC dynamic port connections stumble.

Thanks

Re: Experiencing timeouts on multiple servers - Nagios 3.x

Posted: Wed Apr 12, 2017 4:29 pm
by tgriep
If you are experiencing sporadic timeouts when using the plugin, you can increase it by adding the -t timeout option.
If you add

Code: Select all

-t 60
to the command definiation, that will increase the timeout to 60 seconds and may fix that issue.