Receiving hundreds of "CHECK_NRPE: Socket timeout after 60 s

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
mhawkins
Posts: 2
Joined: Wed May 18, 2011 8:01 am

Receiving hundreds of "CHECK_NRPE: Socket timeout after 60 s

Post by mhawkins »

Receiving hundreds of "CHECK_NRPE: Socket timeout after 60 seconds" at the same time

We are randomly getting hundreds of "CHECK_NRPE: Socket timeout after 60 seconds" all at the same or close to the same time. These are checks that usually run in a couple of seconds. We are almost 100% sure that the actual checks are not running longer than 60 seconds. It happens at random times each day. There are almost no resource contention on the Nagios master server. There is no memory or CPU Spike on the server. We have our open file limits set to 500,000 and it doesn't to ever hit that limit. We have about 45,000 services defined on the Nagios master.

Has anyone experienced this or know how to fix this?
Here is an example of one the hundreds of socket timeout errors from the nagios.log

[1481816105] SERVICE NOTIFICATION: netcool;xxxx;Web_Performance_Monitor_;CRITICAL;send_snmptrap;CHECK_NRPE: Socket timeout after 60 seconds.
[1481816105] wproc: Core Worker 96207: job 22453664 with pid 102644 reaped at timeout. timeouts=1; started=37350
[1481816105] wproc: Core Worker 96221: job 37138 (pid=102996) timed out. Killing it
[1481816105] wproc: CHECK job 37138 from worker Core Worker 96221 timed out after 60.00s
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Receiving hundreds of "CHECK_NRPE: Socket timeout after

Post by tmcdonald »

  • What is your Core version?
  • Are you using any modules like mod_gearman or livestatus?
  • When this happens, how many Critical and Warning services or Down hosts do you have?
Former Nagios employee
mhawkins
Posts: 2
Joined: Wed May 18, 2011 8:01 am

Re: Receiving hundreds of "CHECK_NRPE: Socket timeout after

Post by mhawkins »

I am using Core version 4.2.1. I am not using mod_gearman or livestatus. Currently we have about 13 down hosts. There are 82 critical and 52 warning services.
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Receiving hundreds of "CHECK_NRPE: Socket timeout after

Post by tgriep »

Could you post your nagios.cfg file so we can view it?
Also, login as root to the server, run the following and post that here as well.

Code: Select all

ps -ef --cols=300
Thanks
Be sure to check out our Knowledgebase for helpful articles and solutions!
mrussi
Posts: 6
Joined: Thu Sep 07, 2017 2:24 pm

Re: Receiving hundreds of "CHECK_NRPE: Socket timeout after

Post by mrussi »

Sorry for the thread necromancy, but I felt that it was relevant as I'm a colleague of mhawkins, and we're still experiencing this issue. I've attached the requested nagios.cfg and the output from ps.

Once again, no signs of issues around performance/resource usage of CPU/DISK IO/MEM/NETWORK IO/etc. We did upgrade to 4.3.1 earlier in the year, but we still see the constant false alarms around "CHECK_NRPE: Socket timeout of 60s" as well as "NRPE: Unable to read output" statuses.

The interesting thing we've noted about the "NRPE: Unable to read output" statuses is that they're coming in as "CRITICAL"s rather than the usual "UNKNOWN"s. These come in almost every couple minutes now.

This issue has also been singled out to one datacenter, ALPHA. As an example, another datacenter, BRAVO, with the exact same physical server and nagios.cfg does not exhibit this issue. However, are about 10k less host/service checks in BRAVO. No network connectivity issues per our Network team. Seems to be all related to the application as we swapped in another server running the exact same configuration and it has the same issues.

OS Version: CentOS 6.6
Physical Memory: 125.93GB
CPUs: 40
DISKs: Striped SSDs

Datacenter | Active Host Checks | Active Service Checks
ALPHA | 3839 | 65937
BRAVO | 3146 | 54291
Attachments
nagios.cfg
(45.22 KiB) Downloaded 352 times
ps.txt
(49.03 KiB) Downloaded 366 times
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Receiving hundreds of "CHECK_NRPE: Socket timeout after

Post by tgriep »

The nagios.cfg file looks fairly standard and the settings should work.

In the ps output, it looks like the default port for the NRPE Agent has been changed but I don't think that would cause intermittent timeouts unless something is randomly blocking that port.

Do you see anything happen in the nagios.log files or the /var/log/messages file on the Nagios server before the timeout's happen?

Can you go to the Performance Info menu in the Nagios GUI, screen capture that and post it here?


About the timeouts returning a critical and not an unknown like before, there is a setting in the nagios.cfg file that controls that status. See Below.

Code: Select all

# SERVICE CHECK TIMEOUT STATE
# This setting determines the state Nagios will report when a
# service check times out - that is does not respond within
# service_check_timeout seconds.  This can be useful if a
# machine is running at too high a load and you do not want
# to consider a failed service check to be critical (the default).
# Valid settings are:
# c - Critical (default)
# u - Unknown
# w - Warning
# o - OK

service_check_timeout_state=c

If you want to change the status, just change it to u and that will set the default timeout status back to unknown.
Be sure to check out our Knowledgebase for helpful articles and solutions!
mrussi
Posts: 6
Joined: Thu Sep 07, 2017 2:24 pm

Re: Receiving hundreds of "CHECK_NRPE: Socket timeout after

Post by mrussi »

tgriep wrote: Do you see anything happen in the nagios.log files or the /var/log/messages file on the Nagios server before the timeout's happen?
Can you go to the Performance Info menu in the Nagios GUI, screen capture that and post it here?
About the timeouts returning a critical and not an unknown like before, there is a setting in the nagios.cfg file that controls that status. See Below.
If you want to change the status, just change it to u and that will set the default timeout status back to unknown.
The only Nagios related entries we see in /var/log/messages are the following:

Code: Select all

Sep 14 17:18:29 nagios10w100m3 nagios: job 191935 (pid=104675): read() returned error 11
Sep 14 17:18:30 nagios10w100m3 nagios: job 191970 (pid=106347): read() returned error 11
Sep 14 17:18:30 nagios10w100m3 nagios: job 191972 (pid=106436): read() returned error 11
Sep 14 17:18:30 nagios10w100m3 nagios: job 191978 (pid=106741): read() returned error 11
Sep 14 17:18:31 nagios10w100m3 nagios: job 192013 (pid=108419): read() returned error 11
Sep 14 17:18:31 nagios10w100m3 nagios: job 192020 (pid=108766): read() returned error 11
We've set the logs to debug in the past, but nothing stands out in the logs. It also fills up quickly, even when setting the log file size to 1GB, which makes it a little tough to pinpoint issues. Are there specific metrics we can find and should look for within the logs?

I've attached a screenshot from the Performance Info page.

Does "service_check_timeout_state" also apply to "NRPE: Unable to read output" returning as Critical? I wasn't sure if that also applied since it doesn't say "Service check timed out after 60.00 seconds" in the output.
Attachments
ALPHA_PerformanceInfo.png
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Receiving hundreds of "CHECK_NRPE: Socket timeout after

Post by tgriep »

The error 11 messages is a known issue in Core. Take a look at this link for more details.
https://github.com/NagiosEnterprises/na ... issues/172

The "NRPE: Unable to read output" message is caused by many things and this KB article describes what they could be.
https://support.nagios.com/kb/article/n ... utput.html

It is hard to determine what is causing the intermittent timeouts but your system is running a log of checks in at 5 minutes and it may not be able to keep up.
You may want to increase the number of workers in the nagios.cfg file to 80.

Code: Select all

check_workers=80
The system will normally allocate those but there may not be enough so this will spawn more workers than what is on your system currently.

You may want to try some the tips from this link.
https://assets.nagios.com/downloads/nag ... uning.html
Be sure to check out our Knowledgebase for helpful articles and solutions!
mrussi
Posts: 6
Joined: Thu Sep 07, 2017 2:24 pm

Re: Receiving hundreds of "CHECK_NRPE: Socket timeout after

Post by mrussi »

I've increased number of workers by setting check_workers=80. Unfortunately, we still see the mass amounts of timeouts and "NRPE: Unable to read output" every few minutes. I've had it enabled for about 6 hours now.

Regarding the Tuning documentation, here are my comments:
  • #1 - We graph service check latencies already through PNP4Nagios, but we can look at adding these other metrics as well.
  • #2 - We already use large installation tweaks
  • #3 - We have environment macros disabled by default. enable_environment_macros=0
  • #4/#5 - These don't apply to Nagios Core 4 as they don't make any impact on active checks (which we solely use).
  • #6 - We have it set to unlimited currently through: max_concurrent_checks=0 The link provided in the following text says the same thing that was stated in the v3 docs that "This documentation is being rewritten..." : "More information on service check scheduling can be found here."
  • #7 - Unfortunately, passive checks do not apply to our use case.
  • #8 - 99% of our checks run from the remote hosts, and they are mainly Perl/KSH scripts. We do use a few of the compiled nagios-plugins. However, execution time has historically been very low. We aim for ~5 seconds max execution time.
  • #9 - We currently use this method of 5 ICMP packets. We can look at changing this to the recommended setup, but I don't foresee much change.

    Code: Select all

    command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
  • #10 - For hosts, we have check_interval set to 10.

    Code: Select all

    check_interval 10
  • #11 - We currently have cached_host_check_horizon=15 set. Most of our checks run every 5 minutes, with a few select ones running every minute. Would changing this to 60 seconds cause significant impact?
  • #12 - We have aggressive host checking off by default. use_aggressive_host_checking=0
  • #13 - Not sure if this would help any at all as we see no resource contention on the Nagios host that mainly sits idle.
The issue we're experiencing with the "NRPE: Unable to read output" message is that it's outputting as CRITICAL when run by the Nagios server. However, we cannot replicate it by running the command through check_nrpe ourselves. The checks work fine without ever showing "Unable to read output".

In our experience, when that output is returned as UNKNOWN, it's a consistently reproducible issue on the remote host. In these cases, we cannot ever reproduce it on our own.

For example, Nagios alerted on this.

Code: Select all

[1505501303] SERVICE ALERT: ${HOSTNAME};amq_logs;CRITICAL;HARD;1;NRPE: Unable to read output
Roughly 10 seconds later, I ran the same command against the same host:

Code: Select all

$ time /prd-nagios-app-core/nagios/libexec/check_nrpe -t 60 -H ${HOSTNAME} -p 5659 -c check_logfiles -a ${LOGFILE PATH} amq_logs ; echo $?
OK - no errors or warnings

real	0m0.70s
user	0m0.00s
sys	0m0.00s
0
Could there be some OS-level setting that we need to tweak? I don't see the ephemeral port range being exhausted. Have you seen other users with large installations like ours need some tweaking on the OS?

Code: Select all

$ sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 1024	65535
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Receiving hundreds of "CHECK_NRPE: Socket timeout after

Post by tgriep »

The "NRPE: Unable to read output" message could be caused by a special character in the argument list.
Try enclosing them in single quotes and see if that resolves those errors.

Code: Select all

-a '${LOGFILE PATH} amq_logs'
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked