NSCA 2.9 client problem

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
lhozzan
Posts: 21
Joined: Wed Mar 20, 2019 10:43 am

Re: NSCA 2.9 client problem

Post by lhozzan »

Hello.

Today i do testing of this solution and looks usable. So, NSCA server v2.9 run via xinetd and clients v2.7 + one test client v2.9. Screenshot attached.

But need some change of xinetd config file:

Code: Select all

service nsca
{
  flags           = NORETRY IPv4
  # flags = REUSE -> is deprecated, all services have this flag, https://man.cx/xinetd.conf(5)
  socket_type     = stream
  wait            = no
  user            = nagios
  group           = nagios
  server          = /usr/sbin/nsca
  server_args     = -c /etc/nagios/nsca.cfg --inetd
  type            = UNLISTED
  port            = 5662

  log_on_success  =
  # not show huge START/EXIT notifications from threads in syslog

  # log_on_success  += HOST EXIT DURATION
  # uncomment for debugging connections

  log_on_failure  += HOST ATTEMPT
  disable         = no

  per_source      = 50
  # - should be UNLIMITED
  # - rise, if many "per_source_limit" notifications are in syslog

  instances       = 100
  # - should be UNLIMITED
  # - rise, if many "service_limit" notifications are in syslog

  cps             = 300 10
  # - rise first value, if many "connections per second" notifications are in syslog
}
Changes i commented in this config. Hope, that help another admins to solve this or similar problems.

On the end just few notes:
- value per_source not have any effect to CLOSE-WAIT connections, must rise, because xinetd reject connections
- using UNLIMITED value isnt good in production environment
- when NSCA server is running via xinetd, isnt different between clients version (bug on NSCA client v2.9 side?)
- xinetd causing a higher load to CPU, like when NSCA server running as a standalone service via systemd (see xinet CPU impact)

After discussion with my colleagues i attempt to deploy to production. Again, many thanks for support, it seems, that this way i solution for us.
Attachments
xinet CPU impact
xinet CPU impact
Default + testing state
Default + testing state
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: NSCA 2.9 client problem

Post by cdienger »

Glad to hear! Thanks for sharing your findings!
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
lhozzan
Posts: 21
Joined: Wed Mar 20, 2019 10:43 am

Re: NSCA 2.9 client problem

Post by lhozzan »

Hello.

Today i switch NSCA server from systemd to xinetd and another problems occured:

Code: Select all

Mar 29 12:47:31 localhost xinetd[3812]: Exiting...
Mar 29 12:47:32 localhost systemd: Stopped Xinetd A Powerful Replacement For Inetd.
Mar 29 12:47:32 localhost systemd: Starting Xinetd A Powerful Replacement For Inetd...
Mar 29 12:47:32 localhost systemd: Started Xinetd A Powerful Replacement For Inetd.
Mar 29 12:47:32 localhost xinetd[578]: xinetd Version 2.3.15 started with libwrap loadavg labeled-networking options compiled in.
Mar 29 12:47:32 localhost xinetd[578]: Started working: 8 available services
Mar 29 12:47:36 localhost kernel: TCP: request_sock_TCP: Possible SYN flooding on port 5666. Sending cookies.  Check SNMP counters.
Mar 29 12:47:41 localhost kernel: TCP: request_sock_TCP: Possible SYN flooding on port 5660. Sending cookies.  Check SNMP counters.
Mar 29 12:48:13 localhost kernel: TCP: request_sock_TCP: Possible SYN flooding on port 5664. Sending cookies.  Check SNMP counters.
Mar 29 12:48:13 localhost kernel: TCP: request_sock_TCP: Possible SYN flooding on port 5663. Sending cookies.  Check SNMP counters.
Mar 29 12:48:13 localhost kernel: TCP: request_sock_TCP: Possible SYN flooding on port 5662. Sending cookies.  Check SNMP counters.
Mar 29 12:49:33 localhost nagios: job 19576 (pid=7206): read() returned error 11
We have large cloud installation across all world and on each NSCA port (thread) we have different geo-cluster. Each geo-cluster have aproximetlly 50-100 servers (KVM containers, not bare metal machines).

This problem with SYN flooding we spot, when we have more like 200 servers per one NSCA TCP port on SysV modification too (CentOS6, NSCA server 2.7). This was main reason, why we must split our monitoring to geo-clusters. It seems, that xinetd modification have this limit much lower (~25 servers per one TCP port) with higher CPU load.

CPU load rise too, but by my opinion, to unacceptable scope (+ ~50%!).

I must revert this solution back :( .

I would like to ask you to any idea to check, that is working. My goal is have monitoring server on CentOS7 (actaully is), with NSCA server v2.9 (actaully is too) and NSCA clients v2.9 (actaully here is a problem with sending notifications).

Thank you very much for your effort.
Attachments
CPU load
CPU load
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: NSCA 2.9 client problem

Post by cdienger »

Is this on an XI machine or just a Core machine? What versions? I'd like to do some more testing and want to make sure the systems are as similar as possible.

If you're able to go back to xinetd - I'd try to increase the socket backlog and currently-handshaking limits following https://access.redhat.com/solutions/30453#accepting.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
lhozzan
Posts: 21
Joined: Wed Mar 20, 2019 10:43 am

Re: NSCA 2.9 client problem

Post by lhozzan »

Hello.
Is this on an XI machine or just a Core machine? What versions? I'd like to do some more testing and want to make sure the systems are as similar as possible.
Nagios is default version included in epel repository on CentOS7 (Version: 4.4.3, Revision: 1.el7, Repository: epel) and NSCA too (Version: 2.9.2, Revision: 1.el7, Repository: epel).
If you're able to go back to xinetd - I'd try to increase the socket backlog and currently-handshaking limits following ...
Yes, i try it too, but after several hours, any change not have effect.

Code: Select all

net.core.somaxconn = 120000
net.ipv4.tcp_max_syn_backlog = 6000000
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 60
net.core.rmem_max=24000000
net.core.wmem_max=212992
net.ipv4.tcp_synack_retries=2
net.ipv4.tcp_slow_start_after_idle=0
net.netfilter.nf_conntrack_max=2000000
net.core.netdev_max_backlog=128000
... and this values is near double, like we using on application loadbalancers with thousand requests per seconds. I think, crazy values :).

In xinetd config i must use this values:

Code: Select all

per_source      = 400
instances       = 800
cps             = 550 10
... for success running.

You can see network status on screenshot (testing ~11:00 -> ~14:30, most load geocluster). Still ~100 connections in TIME_WAIT state, but i think, this maybe was default behaviour for xinetd.

This server instance have allocated 4cores (Xeon CPU on bare metal) and 8GB RAM. No other software here running, provide only supervising services via Nagios.

Every server sent ~5-10 checks per 10 seconds.

If you need some aditional information, please, let me know. As i wrote, i need solve this problem. If you have some theory for testing, glad to try it. For manage this large infrastructure we using SaltStack, so, i wrote states for deploying. Can be easly and quickly switching between configurations, but cost some time for developing.

Many thanks for your effort.
Attachments
Testing - connection view
Testing - connection view
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: NSCA 2.9 client problem

Post by cdienger »

I installed the same packages and didn't have any issues until I did a system upgrade with a "yum upgrade". Now when nsca connections come in frequently, I can see the state switching the CLOSE_WAIT frequently. They don't seem to stick around though - if you run "netstat -nap | grep 5667" and monitor this, do you see a lot of connecting getting into this state at a time? Run this multiple times and see if you can get an idea for how long the connections are sticking around in the CLOSE_WAIT state.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
lhozzan
Posts: 21
Joined: Wed Mar 20, 2019 10:43 am

Re: NSCA 2.9 client problem

Post by lhozzan »

Hello.

I do tests, what you want and it seems, that this problem should be in NSCA client v2.9 (NSCA server v2.9 run via systemd).

When i upgraded on one server NSCA client (v2.7 -> v2.9), looks, like working correctly. After several times i spot, that on this client is same connection still in state CLOSE_WAIT (identical internal port 38928).

Code: Select all

[operator@server ~]$ date; netstat -nap | grep 5662 | grep "178.79.165.168"
Ut apr  2 07:00:51 UTC 2019
tcp        1      0 45.33.80.18:5662        178.79.165.168:38928    CLOSE_WAIT  4879/nsca_uk 
... after about 10 minutes

Code: Select all

[operator@server ~]$ date; netstat -nap | grep 5662 | grep "178.79.165.168"
Ut apr  2 07:10:26 UTC 2019
tcp        1      0 45.33.80.18:5662        178.79.165.168:40776    CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        178.79.165.168:38928    CLOSE_WAIT  4879/nsca_uk
... after next about 10 minutes

Code: Select all

[operator@server ~]$ date; netstat -nap | grep 5662 | grep "178.79.165.168"
Ut apr  2 07:22:08 UTC 2019
tcp        1      0 45.33.80.18:5662        178.79.165.168:40776    CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        178.79.165.168:38928    CLOSE_WAIT  4879/nsca_uk
I also spot, that this CLOSE_WAIT connections is generated by NSCA client v2.7 too, but in higher rate. It seems, that sometimes is close, expect by some kernel tresholds.

Code: Select all

[operator@server ~]$ date; netstat -nap | grep 5662 | grep "212.111.43.83"
Ut apr  2 07:26:59 UTC 2019
tcp        1      0 45.33.80.18:5662        212.111.43.83:41741     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:46720     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:39906     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:55612     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:60602     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:33125     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:40373     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:48312     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:46791     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:35472     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:38193     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:44717     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:55611     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:44055     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:37005     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:42794     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:41789     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:40794     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:44250     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:46255     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:52325     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:40884     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:47640     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:37520     CLOSE_WAIT  4879/nsca_uk 
...

Code: Select all

[operator@server ~]$ date; netstat -nap | grep 5662 | grep "212.111.43.83"
Ut apr  2 07:31:03 UTC 2019
tcp        1      0 45.33.80.18:5662        212.111.43.83:41741     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:46720     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:39906     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:55612     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:60602     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:33125     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:40373     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:48312     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:46791     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:35472     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:50978     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:38193     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:44717     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:55611     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:44055     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:37005     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:42794     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:48628     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:51626     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:41789     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:40794     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:44250     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:46255     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:49628     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:52325     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:40884     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:47640     CLOSE_WAIT  4879/nsca_uk        
tcp        1      0 45.33.80.18:5662        212.111.43.83:37520     CLOSE_WAIT  4879/nsca_uk
I think, but not sure, that NSCA client v2.9 attempt to establish stream, not a single connect. Attempt to recycle existed connections, but still create new one. Bug?

If you wish, i can run this test on NSCA server v2.9 via xinetd, but this solution overload server, i dont want use this solution.

Thank you for your effort.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: NSCA 2.9 client problem

Post by cdienger »

The strace did show a lot of messages regarding not being able to allocate enough memory and the syslog showed a lot of messages regarding having a lot of files open so I think this has to do with system settings. What is the output of "ulimit -a"? Do you have anything configured in /etc/sysctl.conf or /etc/security/limits.conf? Check out https://support.nagios.com/kb/article/n ... ng-19.html for an example of increasing values via /etc/security/limits.conf.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
lhozzan
Posts: 21
Joined: Wed Mar 20, 2019 10:43 am

Re: NSCA 2.9 client problem

Post by lhozzan »

Hello.

Yes, we have custom unit file for systemd:

Code: Select all

[Unit]
#Description=Nagios Service Check Acceptor
Description=NSCA for uk cluster
After=syslog.target network.target auditd.service

[Service]
# default LimitNOFILE 1024 is too low
LimitNOFILE=4096
EnvironmentFile=-/etc/sysconfig/nsca
ExecStart=/usr/sbin/nsca_uk $OPTIONS -c /etc/nagios/nsca_uk.cfg
ExecReload=/bin/kill -HUP $MAINPID
Type=forking

# Auto-restart, when fail
Restart=on-failure
RestartSec=10
StartLimitInterval=35
StartLimitBurst=3

[Install]
WantedBy=multi-user.target
When open files limit is reached, NSCA generate error about problem with opened new file and stop accepting new checks. If you wish, i can rise this value, but not sure, that help.
What is the output of "ulimit -a"?

Code: Select all

[operator@server ~]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 31824
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 31824
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Do you have anything configured in /etc/sysctl.conf or /etc/security/limits.conf?
No, we using default values. In this configs is only coments.
Check out https://support.nagios.com/kb/article/n ... ng-19.html for an example of increasing values via /etc/security/limits.conf.
OK, attempt place this values and let you know.
lhozzan
Posts: 21
Joined: Wed Mar 20, 2019 10:43 am

Re: NSCA 2.9 client problem

Post by lhozzan »

Hello.

Unfortunatelly, this isnt working too. Equal behavior.

Have you aditional idea, what to check?

Thank you for your effort.
Attachments
Normal - Reboot - Upgrade - Downgrade
Normal - Reboot - Upgrade - Downgrade
Locked