NSCA 2.9 client problem

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: NSCA 2.9 client problem

Post by cdienger »

Are you able to lab up another machine and reproduce this? I'd like to get another strace while a tcpdump is running as well, and run these both for a bit while the netstat output is monitored so we can focus on specific ports and connections to look for. Doing so on the current machine will likely result in files too large to analyze.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
lhozzan
Posts: 21
Joined: Wed Mar 20, 2019 10:43 am

Re: NSCA 2.9 client problem

Post by lhozzan »

Hello.

It is possible to create vagrant VM with identical environment for Nagios server and 2 aditional VM (one with NSCA client v2.7 and one with NSCA client v2.9). Next i can run suggested commands:

Code: Select all

tcpdump -s 0 -i any port 5667 -w output.pcap
on each VMs.

Do you mean this procedure?
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: NSCA 2.9 client problem

Post by cdienger »

Yes, except I would be mainly interested in seeing just an environment with a 2.9 client for now. And I would also want to get another strace at the same time the new tcpdump was running:

Code: Select all

strace -pPID
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
lhozzan
Posts: 21
Joined: Wed Mar 20, 2019 10:43 am

Re: NSCA 2.9 client problem

Post by lhozzan »

Hello.

Added requested files to my GoogleDrive, have prefix dev_.

Server IP is 172.17.1.23 and clients IP is 172.17.1.161 (NSCA v2.7), 172.17.1.11 (NSCA v2.9). TCPDump is from server and client with v2.9. Each VM are CentOS7.

In this setup i dont spot any opened CLOSE_WAIT connections. It seems, that this occured only when many servers send reports. Or, should be closed by kernel and we cannot spot it, because in this setup is too small count of total checks. Hope that helps.

Is there anything else, what can i do for you?

Thank you for your effort.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: NSCA 2.9 client problem

Post by cdienger »

Unfortunately the data doesn't tell us much if the issue wasn't reproduced. The process is running as nagios, so let's switch to the nagios user and get the ulimit output:

Code: Select all

su - nagios
ulimit -a
I'd also like to get copies of /etc/security/limits.conf and /etc/security/limits.d/90-nproc.conf
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
lhozzan
Posts: 21
Joined: Wed Mar 20, 2019 10:43 am

Re: NSCA 2.9 client problem

Post by lhozzan »

Hello.
cdienger wrote:Unfortunately the data doesn't tell us much if the issue wasn't reproduced.
Yes, i know. This is reason, why i ask for support and send statistics from production environment. In development environment i not spot any problems. It seems, that this problem occured, when more like ~25 servers send notifications. This environment, similar like production, i cannot simulate :(.
cdienger wrote:The process is running as nagios, so let's switch to the nagios user and get the ulimit outpu
Unfortunatelly, we dont allow access to this account for security reason, but i can obtain limits for current running process.

Code: Select all

nagios:x:XXX:XXX:nagios,,,:/home/nagios:/sbin/nologin
Developer environment:

Code: Select all

[root@status.global-devel ~]$ ps aux | grep nsca_glob-dev
nagios    2480  0.2  0.0  10912   892 ?        Ss   06:03   0:04 /usr/sbin/nsca_glob-dev -c /etc/nagios/nsca_glob-dev.cfg
root     17535  0.0  0.0  12492   696 pts/0    R+   06:33   0:00 grep --color=auto nsca_glob-dev
[root@status.global-devel ~]$ cat /proc/2480/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             3906                 3906                 processes 
Max open files            4096                 4096                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       3906                 3906                 signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us       
Production environment:

Code: Select all

[operator@server ~]$ ps aux | grep nsca_uk
root      3713  0.0  0.0 112740  2324 pts/0    S+   06:37   0:00 grep --color=auto nsca_uk
nagios    4198  1.6  0.0  11056  2260 ?        SLs  apr05 164:06 /usr/sbin/nsca_uk -c /etc/nagios/nsca_uk.cfg
[operator@server ~]$ cat /proc/4198/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             31824                31824                processes 
Max open files            4096                 4096                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       31824                31824                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
cdienger wrote:I'd also like to get copies of /etc/security/limits.conf and /etc/security/limits.d/90-nproc.conf
File limits.conf we dont modify, so using default (identical for developer and production environment):

Code: Select all

[operator@server ~]$ cat /etc/security/limits.conf
# /etc/security/limits.conf
#
#This file sets the resource limits for the users logged in via PAM.
#It does not affect resource limits of the system services.
#
#Also note that configuration files in /etc/security/limits.d directory,
#which are read in alphabetical order, override the settings in this
#file in case the domain is the same or more specific.
#That means for example that setting a limit for wildcard domain here
#can be overriden with a wildcard setting in a config file in the
#subdirectory, but a user specific setting here can be overriden only
#with a user specific setting in the subdirectory.
#
#Each line describes a limit for a user in the form:
#
#<domain>        <type>  <item>  <value>
#
#Where:
#<domain> can be:
#        - a user name
#        - a group name, with @group syntax
#        - the wildcard *, for default entry
#        - the wildcard %, can be also used with %group syntax,
#                 for maxlogin limit
#
#<type> can have the two values:
#        - "soft" for enforcing the soft limits
#        - "hard" for enforcing hard limits
#
#<item> can be one of the following:
#        - core - limits the core file size (KB)
#        - data - max data size (KB)
#        - fsize - maximum filesize (KB)
#        - memlock - max locked-in-memory address space (KB)
#        - nofile - max number of open file descriptors
#        - rss - max resident set size (KB)
#        - stack - max stack size (KB)
#        - cpu - max CPU time (MIN)
#        - nproc - max number of processes
#        - as - address space limit (KB)
#        - maxlogins - max number of logins for this user
#        - maxsyslogins - max number of logins on the system
#        - priority - the priority to run user process with
#        - locks - max number of file locks the user can hold
#        - sigpending - max number of pending signals
#        - msgqueue - max memory used by POSIX message queues (bytes)
#        - nice - max nice priority allowed to raise to values: [-20, 19]
#        - rtprio - max realtime priority
#
#<domain>      <type>  <item>         <value>
#

#*               soft    core            0
#*               hard    rss             10000
#@student        hard    nproc           20
#@faculty        soft    nproc           20
#@faculty        hard    nproc           50
#ftp             hard    nproc           0
#@student        -       maxlogins       4

# End of file
File 90-nproc.conf not exist for us (identical situation for developer and production environment):

Code: Select all

[operator@server ~]$ cat /etc/security/limits.d/90-nproc.conf
cat: /etc/security/limits.d/90-nproc.conf: No such file or directory
Is there anything else, what can i do for forwarding progress?

Thank you for your effort.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: NSCA 2.9 client problem

Post by cdienger »

I've been assuming that the nsca package has been 2.9.2 and not just 2.9, but please confirm it with:

Code: Select all

yum list | grep nsca
Also, add the following to /etc/security/limits.conf:

Code: Select all

#locked memory 
* hard memlock 256
* soft memlock 256

#open files 
* soft nofile 10000
* hard nofile 10000
root hard nofile 10000
root soft nofile 10000

#max user processes
* hard nproc 4096 
* soft nproc 4096

#stack size
* hard stack 20480
* soft stack 20480
reboot the server, and then check the ulimit values for root and the nagios user to make sure the memlock values have been updated.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
lhozzan
Posts: 21
Joined: Wed Mar 20, 2019 10:43 am

Re: NSCA 2.9 client problem

Post by lhozzan »

Hello.
cdienger wrote:I've been assuming that the nsca package has been 2.9.2 and not just 2.9, but please confirm it with:
Yes, this is correct. We use default available package version.

Code: Select all

[operator@server ~]$ yum list | grep nsca
nsca.x86_64                             2.9.2-1.el7                    @epel    
nsca-client.x86_64                      2.9.2-1.el7                    @epel    
cdienger wrote:Also, add the following to /etc/security/limits.conf:
...
reboot the server, and then check the ulimit values for root and the nagios user to make sure the memlock values have been updated.
OK, placed, after reboot is situation next:

Code: Select all

[root@server ~]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 31824
max locked memory       (kbytes, -l) 256
max memory size         (kbytes, -m) unlimited
open files                      (-n) 10000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 20480
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
For nagios user:

Code: Select all

[root@server ~]$ su - nagios -c ulimit' -a'
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 31824
max locked memory       (kbytes, -l) 256
max memory size         (kbytes, -m) unlimited
open files                      (-n) 10000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 20480
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
For specify processes:

Code: Select all

[root@server ~]$ ps aux | grep nsca_uk
nagios    2451  1.6  0.0  11056  2236 ?        SLs  07:07   0:20 /usr/sbin/nsca_uk -c /etc/nagios/nsca_uk.cfg
root     22180  0.0  0.0 112744  2336 pts/0    S+   07:28   0:00 grep --color=auto nsca_uk
[root@server ~]$ cat /proc/2451/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             31824                31824                processes 
Max open files            4096                 4096                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       31824                31824                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
I spot problem with "SYNC attack" for another NSCA process:

Code: Select all

[root@server ~]$ grep "SYN" /var/log/messages
Apr 15 07:07:40 localhost kernel: TCP: request_sock_TCP: Possible SYN flooding on port 5660. Sending cookies.  Check SNMP counters.
Limits are also equal:

Code: Select all

|7|[root@status.linode-us-nj ~]$ ps aux | grep nsca_de
nagios    2454  4.4  0.0  11980  3080 ?        SLs  07:07   0:57 /usr/sbin/nsca_de -c /etc/nagios/nsca_de.cfg
root     22818  0.0  0.0 112744  2320 pts/0    S+   07:29   0:00 grep --color=auto nsca_de
[root@server ~]$ cat /proc/2454/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             31824                31824                processes 
Max open files            4096                 4096                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       31824                31824                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us   
Now i let a longer, but unfortunatelly, this isnt help :(. See my screenshot.

Is there anything else, what can i do for forwarding progress?

Thank you for your effort.
Attachments
normal, reboot, upgrade
normal, reboot, upgrade
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: NSCA 2.9 client problem

Post by tgriep »

Can you post an example of how a typical client is setup?
What commands and scripts is it running and how it is ran?

On the NSCA server (Nagios) edit the nsca.cfg file and enable debugging by changing this from

Code: Select all

debug=0
to

Code: Select all

debug=1
Save and restart the NSCA server.

Then if a connection hangs on the server, look in the /var/log/messages file to see if you can find out why.
Be sure to check out our Knowledgebase for helpful articles and solutions!
lhozzan
Posts: 21
Joined: Wed Mar 20, 2019 10:43 am

Re: NSCA 2.9 client problem

Post by lhozzan »

Hello.
tgriep wrote:Can you post an example of how a typical client is setup?
What commands and scripts is it running and how it is ran?
We use collectd for collecting all for us interesting metrics. In collectd we use treshold plugin for metrics evaluation. If is watching metrics OK or worng, collectd sent to Nagios via NSCA this status in 10 seconds interval. In fact, we watching on each server different count of metrics like CPU load, disk space, network status, ...
Collectd call Perl NSCA plugin to parse his output format to NSCA input format and sending to Nagios server.

collectd-tresholds.conf

Code: Select all

LoadPlugin "threshold"

<Plugin "threshold">

    <Plugin "processes">
        Instance "all"
        <Type "ps_count">
            DataSource "processes"
            FailureMin    1
            Invert false
            Persist true
            PersistOK true
        </Type>
    </Plugin>

    <Plugin "load">
        <Type "load">
            DataSource "longterm"
            WarningMin 5
            FailureMin 7
            Invert true
            Persist true
            PersistOK true
        </Type>
    </Plugin>  

    <Plugin "df">
        Instance "root"
        <Type "percent_bytes">
            Instance "used"
            DataSource "value"
            WarningMin 75
            FailureMin 90
            Invert true
            Persist true
            PersistOK true
        </Type>
    </Plugin>  
</Plugin>  
nagios_nsca.conf

Code: Select all

<LoadPlugin perl>
  Globals true
</LoadPlugin>

<Plugin perl>
  IncludeDir "/srv/utils/perl"
  BaseName "Collectd::Plugins"
  LoadPlugin nagios_passive
  <Plugin nagios_passive>
    debug 0
    debugDump 0
  </Plugin>
</Plugin>
If you wish, i can share this Perl plugin, but i think, that interesting line is:

Code: Select all

system("echo '$passiv' | send_nsca -H NSCA-IP -p NSCA-PORT -c SENT-NSCA-CONFIG");
On the NSCA server (Nagios) edit the nsca.cfg file and enable debugging by changing this from
So, turn debug mode on.
When i filtered all default messages by command

Code: Select all

grep -v -e "End of connection" -e "Handling the connection" -e "Attempting to write" -e "Time difference" -e "SERVICE CHECK" -e "HOST CHECK" /var/log/messages
I see many suppressed messages :(.

Code: Select all

Apr 16 10:00:31 localhost nsca[1772]: Starting up daemon
Apr 16 10:00:31 localhost systemd: Started NSCA for uk cluster.
Apr 16 10:00:35 localhost nagios: job 3273 (pid=1820): read() returned error 11
Apr 16 10:01:01 localhost systemd: Started Session 193 of user root.
Apr 16 10:01:01 localhost journal: Suppressed 5344 messages from /system.slice/nsca_uk.service
Apr 16 10:01:01 localhost journal: Suppressed 1677 messages from /system.slice/nsca_uk.service
Apr 16 10:01:01 localhost journal: Suppressed 9067 messages from /system.slice/nsca_uk.service
Apr 16 10:01:27 localhost nagios: job 3274 (pid=2293): read() returned error 11
Apr 16 10:01:31 localhost journal: Suppressed 5416 messages from /system.slice/nsca_uk.service
Apr 16 10:01:31 localhost journal: Suppressed 1710 messages from /system.slice/nsca_uk.service
Apr 16 10:01:31 localhost journal: Suppressed 9120 messages from /system.slice/nsca_uk.service
Apr 16 10:01:51 localhost nagios: job 3278 (pid=2572): read() returned error 11
Apr 16 10:02:01 localhost journal: Suppressed 8685 messages from /system.slice/nsca_uk.service
Apr 16 10:02:01 localhost journal: Suppressed 5178 messages from /system.slice/nsca_uk.service
Apr 16 10:02:01 localhost journal: Suppressed 1852 messages from /system.slice/nsca_uk.service
Apr 16 10:03:32 localhost journal: Suppressed 9009 messages from /system.slice/nsca_uk.service
Apr 16 10:03:32 localhost journal: Suppressed 5529 messages from /system.slice/nsca_uk.service
Apr 16 10:03:32 localhost journal: Suppressed 1943 messages from /system.slice/nsca_uk.service
Apr 16 10:03:32 localhost rsyslogd: imjournal: 14826 messages lost due to rate-limiting
Apr 16 10:04:02 localhost journal: Suppressed 9349 messages from /system.slice/nsca_uk.service
Apr 16 10:04:02 localhost journal: Suppressed 5659 messages from /system.slice/nsca_uk.service
Apr 16 10:04:02 localhost journal: Suppressed 1963 messages from /system.slice/nsca_uk.service
Apr 16 10:04:11 localhost nagios: job 3280 (pid=3842): read() returned error 11
Apr 16 10:04:11 localhost nagios: job 3280 (pid=3841): read() returned error 11
Apr 16 10:04:11 localhost nagios: job 3281 (pid=3852): read() returned error 11
Apr 16 10:04:11 localhost nagios: job 3281 (pid=3851): read() returned error 11
Apr 16 10:04:11 localhost nagios: job 3281 (pid=3868): read() returned error 11
Apr 16 10:04:11 localhost nagios: job 3281 (pid=3873): read() returned error 11
Apr 16 10:04:33 localhost journal: Suppressed 9770 messages from /system.slice/nsca_uk.service
Apr 16 10:04:33 localhost journal: Suppressed 5929 messages from /system.slice/nsca_uk.service
Apr 16 10:04:33 localhost journal: Suppressed 2093 messages from /system.slice/nsca_uk.service
Apr 16 10:05:03 localhost journal: Suppressed 9688 messages from /system.slice/nsca_uk.service
Apr 16 10:05:03 localhost journal: Suppressed 5792 messages from /system.slice/nsca_uk.service
Apr 16 10:05:03 localhost journal: Suppressed 1986 messages from /system.slice/nsca_uk.service
Apr 16 10:13:34 localhost rsyslogd: imjournal: 44375 messages lost due to rate-limiting
Apr 16 10:14:25 localhost nagios: job 3301 (pid=9321): read() returned error 11
Apr 16 10:14:25 localhost nagios: job 3302 (pid=9327): read() returned error 11
It seems, that all working fine, but NSCA not able handle that many connections right. Especially, when v2.9 client is somewhere installed.

Have you any idea, what to check next for working solution?

Thank you for your effort.
Locked