NSCA 2.9 client problem
Re: NSCA 2.9 client problem
Are you able to lab up another machine and reproduce this? I'd like to get another strace while a tcpdump is running as well, and run these both for a bit while the netstat output is monitored so we can focus on specific ports and connections to look for. Doing so on the current machine will likely result in files too large to analyze.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: NSCA 2.9 client problem
Hello.
It is possible to create vagrant VM with identical environment for Nagios server and 2 aditional VM (one with NSCA client v2.7 and one with NSCA client v2.9). Next i can run suggested commands:
on each VMs.
Do you mean this procedure?
It is possible to create vagrant VM with identical environment for Nagios server and 2 aditional VM (one with NSCA client v2.7 and one with NSCA client v2.9). Next i can run suggested commands:
Code: Select all
tcpdump -s 0 -i any port 5667 -w output.pcap
Do you mean this procedure?
Re: NSCA 2.9 client problem
Yes, except I would be mainly interested in seeing just an environment with a 2.9 client for now. And I would also want to get another strace at the same time the new tcpdump was running:
Code: Select all
strace -pPID
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: NSCA 2.9 client problem
Hello.
Added requested files to my GoogleDrive, have prefix dev_.
Server IP is 172.17.1.23 and clients IP is 172.17.1.161 (NSCA v2.7), 172.17.1.11 (NSCA v2.9). TCPDump is from server and client with v2.9. Each VM are CentOS7.
In this setup i dont spot any opened CLOSE_WAIT connections. It seems, that this occured only when many servers send reports. Or, should be closed by kernel and we cannot spot it, because in this setup is too small count of total checks. Hope that helps.
Is there anything else, what can i do for you?
Thank you for your effort.
Added requested files to my GoogleDrive, have prefix dev_.
Server IP is 172.17.1.23 and clients IP is 172.17.1.161 (NSCA v2.7), 172.17.1.11 (NSCA v2.9). TCPDump is from server and client with v2.9. Each VM are CentOS7.
In this setup i dont spot any opened CLOSE_WAIT connections. It seems, that this occured only when many servers send reports. Or, should be closed by kernel and we cannot spot it, because in this setup is too small count of total checks. Hope that helps.
Is there anything else, what can i do for you?
Thank you for your effort.
Re: NSCA 2.9 client problem
Unfortunately the data doesn't tell us much if the issue wasn't reproduced. The process is running as nagios, so let's switch to the nagios user and get the ulimit output:
I'd also like to get copies of /etc/security/limits.conf and /etc/security/limits.d/90-nproc.conf
Code: Select all
su - nagios
ulimit -a
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: NSCA 2.9 client problem
Hello.
Developer environment:
Production environment:
File 90-nproc.conf not exist for us (identical situation for developer and production environment):
Is there anything else, what can i do for forwarding progress?
Thank you for your effort.
Yes, i know. This is reason, why i ask for support and send statistics from production environment. In development environment i not spot any problems. It seems, that this problem occured, when more like ~25 servers send notifications. This environment, similar like production, i cannot simulate .cdienger wrote:Unfortunately the data doesn't tell us much if the issue wasn't reproduced.
Unfortunatelly, we dont allow access to this account for security reason, but i can obtain limits for current running process.cdienger wrote:The process is running as nagios, so let's switch to the nagios user and get the ulimit outpu
Code: Select all
nagios:x:XXX:XXX:nagios,,,:/home/nagios:/sbin/nologin
Code: Select all
[root@status.global-devel ~]$ ps aux | grep nsca_glob-dev
nagios 2480 0.2 0.0 10912 892 ? Ss 06:03 0:04 /usr/sbin/nsca_glob-dev -c /etc/nagios/nsca_glob-dev.cfg
root 17535 0.0 0.0 12492 696 pts/0 R+ 06:33 0:00 grep --color=auto nsca_glob-dev
[root@status.global-devel ~]$ cat /proc/2480/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 3906 3906 processes
Max open files 4096 4096 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 3906 3906 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Code: Select all
[operator@server ~]$ ps aux | grep nsca_uk
root 3713 0.0 0.0 112740 2324 pts/0 S+ 06:37 0:00 grep --color=auto nsca_uk
nagios 4198 1.6 0.0 11056 2260 ? SLs apr05 164:06 /usr/sbin/nsca_uk -c /etc/nagios/nsca_uk.cfg
[operator@server ~]$ cat /proc/4198/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 31824 31824 processes
Max open files 4096 4096 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 31824 31824 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
File limits.conf we dont modify, so using default (identical for developer and production environment):cdienger wrote:I'd also like to get copies of /etc/security/limits.conf and /etc/security/limits.d/90-nproc.conf
Code: Select all
[operator@server ~]$ cat /etc/security/limits.conf
# /etc/security/limits.conf
#
#This file sets the resource limits for the users logged in via PAM.
#It does not affect resource limits of the system services.
#
#Also note that configuration files in /etc/security/limits.d directory,
#which are read in alphabetical order, override the settings in this
#file in case the domain is the same or more specific.
#That means for example that setting a limit for wildcard domain here
#can be overriden with a wildcard setting in a config file in the
#subdirectory, but a user specific setting here can be overriden only
#with a user specific setting in the subdirectory.
#
#Each line describes a limit for a user in the form:
#
#<domain> <type> <item> <value>
#
#Where:
#<domain> can be:
# - a user name
# - a group name, with @group syntax
# - the wildcard *, for default entry
# - the wildcard %, can be also used with %group syntax,
# for maxlogin limit
#
#<type> can have the two values:
# - "soft" for enforcing the soft limits
# - "hard" for enforcing hard limits
#
#<item> can be one of the following:
# - core - limits the core file size (KB)
# - data - max data size (KB)
# - fsize - maximum filesize (KB)
# - memlock - max locked-in-memory address space (KB)
# - nofile - max number of open file descriptors
# - rss - max resident set size (KB)
# - stack - max stack size (KB)
# - cpu - max CPU time (MIN)
# - nproc - max number of processes
# - as - address space limit (KB)
# - maxlogins - max number of logins for this user
# - maxsyslogins - max number of logins on the system
# - priority - the priority to run user process with
# - locks - max number of file locks the user can hold
# - sigpending - max number of pending signals
# - msgqueue - max memory used by POSIX message queues (bytes)
# - nice - max nice priority allowed to raise to values: [-20, 19]
# - rtprio - max realtime priority
#
#<domain> <type> <item> <value>
#
#* soft core 0
#* hard rss 10000
#@student hard nproc 20
#@faculty soft nproc 20
#@faculty hard nproc 50
#ftp hard nproc 0
#@student - maxlogins 4
# End of file
Code: Select all
[operator@server ~]$ cat /etc/security/limits.d/90-nproc.conf
cat: /etc/security/limits.d/90-nproc.conf: No such file or directory
Thank you for your effort.
Re: NSCA 2.9 client problem
I've been assuming that the nsca package has been 2.9.2 and not just 2.9, but please confirm it with:
Also, add the following to /etc/security/limits.conf:
reboot the server, and then check the ulimit values for root and the nagios user to make sure the memlock values have been updated.
Code: Select all
yum list | grep nsca
Code: Select all
#locked memory
* hard memlock 256
* soft memlock 256
#open files
* soft nofile 10000
* hard nofile 10000
root hard nofile 10000
root soft nofile 10000
#max user processes
* hard nproc 4096
* soft nproc 4096
#stack size
* hard stack 20480
* soft stack 20480
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: NSCA 2.9 client problem
Hello.
For nagios user:
For specify processes:
I spot problem with "SYNC attack" for another NSCA process:
Limits are also equal:
Now i let a longer, but unfortunatelly, this isnt help . See my screenshot.
Is there anything else, what can i do for forwarding progress?
Thank you for your effort.
Yes, this is correct. We use default available package version.cdienger wrote:I've been assuming that the nsca package has been 2.9.2 and not just 2.9, but please confirm it with:
Code: Select all
[operator@server ~]$ yum list | grep nsca
nsca.x86_64 2.9.2-1.el7 @epel
nsca-client.x86_64 2.9.2-1.el7 @epel
OK, placed, after reboot is situation next:cdienger wrote:Also, add the following to /etc/security/limits.conf:
...
reboot the server, and then check the ulimit values for root and the nagios user to make sure the memlock values have been updated.
Code: Select all
[root@server ~]$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 31824
max locked memory (kbytes, -l) 256
max memory size (kbytes, -m) unlimited
open files (-n) 10000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 20480
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Code: Select all
[root@server ~]$ su - nagios -c ulimit' -a'
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 31824
max locked memory (kbytes, -l) 256
max memory size (kbytes, -m) unlimited
open files (-n) 10000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 20480
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Code: Select all
[root@server ~]$ ps aux | grep nsca_uk
nagios 2451 1.6 0.0 11056 2236 ? SLs 07:07 0:20 /usr/sbin/nsca_uk -c /etc/nagios/nsca_uk.cfg
root 22180 0.0 0.0 112744 2336 pts/0 S+ 07:28 0:00 grep --color=auto nsca_uk
[root@server ~]$ cat /proc/2451/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 31824 31824 processes
Max open files 4096 4096 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 31824 31824 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Code: Select all
[root@server ~]$ grep "SYN" /var/log/messages
Apr 15 07:07:40 localhost kernel: TCP: request_sock_TCP: Possible SYN flooding on port 5660. Sending cookies. Check SNMP counters.
Code: Select all
|7|[root@status.linode-us-nj ~]$ ps aux | grep nsca_de
nagios 2454 4.4 0.0 11980 3080 ? SLs 07:07 0:57 /usr/sbin/nsca_de -c /etc/nagios/nsca_de.cfg
root 22818 0.0 0.0 112744 2320 pts/0 S+ 07:29 0:00 grep --color=auto nsca_de
[root@server ~]$ cat /proc/2454/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 31824 31824 processes
Max open files 4096 4096 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 31824 31824 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Is there anything else, what can i do for forwarding progress?
Thank you for your effort.
Re: NSCA 2.9 client problem
Can you post an example of how a typical client is setup?
What commands and scripts is it running and how it is ran?
On the NSCA server (Nagios) edit the nsca.cfg file and enable debugging by changing this from
to
Save and restart the NSCA server.
Then if a connection hangs on the server, look in the /var/log/messages file to see if you can find out why.
What commands and scripts is it running and how it is ran?
On the NSCA server (Nagios) edit the nsca.cfg file and enable debugging by changing this from
Code: Select all
debug=0
Code: Select all
debug=1
Then if a connection hangs on the server, look in the /var/log/messages file to see if you can find out why.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: NSCA 2.9 client problem
Hello.
Collectd call Perl NSCA plugin to parse his output format to NSCA input format and sending to Nagios server.
collectd-tresholds.conf
nagios_nsca.conf
If you wish, i can share this Perl plugin, but i think, that interesting line is:
When i filtered all default messages by command
I see many suppressed messages .
It seems, that all working fine, but NSCA not able handle that many connections right. Especially, when v2.9 client is somewhere installed.
Have you any idea, what to check next for working solution?
Thank you for your effort.
We use collectd for collecting all for us interesting metrics. In collectd we use treshold plugin for metrics evaluation. If is watching metrics OK or worng, collectd sent to Nagios via NSCA this status in 10 seconds interval. In fact, we watching on each server different count of metrics like CPU load, disk space, network status, ...tgriep wrote:Can you post an example of how a typical client is setup?
What commands and scripts is it running and how it is ran?
Collectd call Perl NSCA plugin to parse his output format to NSCA input format and sending to Nagios server.
collectd-tresholds.conf
Code: Select all
LoadPlugin "threshold"
<Plugin "threshold">
<Plugin "processes">
Instance "all"
<Type "ps_count">
DataSource "processes"
FailureMin 1
Invert false
Persist true
PersistOK true
</Type>
</Plugin>
<Plugin "load">
<Type "load">
DataSource "longterm"
WarningMin 5
FailureMin 7
Invert true
Persist true
PersistOK true
</Type>
</Plugin>
<Plugin "df">
Instance "root"
<Type "percent_bytes">
Instance "used"
DataSource "value"
WarningMin 75
FailureMin 90
Invert true
Persist true
PersistOK true
</Type>
</Plugin>
</Plugin>
Code: Select all
<LoadPlugin perl>
Globals true
</LoadPlugin>
<Plugin perl>
IncludeDir "/srv/utils/perl"
BaseName "Collectd::Plugins"
LoadPlugin nagios_passive
<Plugin nagios_passive>
debug 0
debugDump 0
</Plugin>
</Plugin>
Code: Select all
system("echo '$passiv' | send_nsca -H NSCA-IP -p NSCA-PORT -c SENT-NSCA-CONFIG");
So, turn debug mode on.On the NSCA server (Nagios) edit the nsca.cfg file and enable debugging by changing this from
When i filtered all default messages by command
Code: Select all
grep -v -e "End of connection" -e "Handling the connection" -e "Attempting to write" -e "Time difference" -e "SERVICE CHECK" -e "HOST CHECK" /var/log/messages
Code: Select all
Apr 16 10:00:31 localhost nsca[1772]: Starting up daemon
Apr 16 10:00:31 localhost systemd: Started NSCA for uk cluster.
Apr 16 10:00:35 localhost nagios: job 3273 (pid=1820): read() returned error 11
Apr 16 10:01:01 localhost systemd: Started Session 193 of user root.
Apr 16 10:01:01 localhost journal: Suppressed 5344 messages from /system.slice/nsca_uk.service
Apr 16 10:01:01 localhost journal: Suppressed 1677 messages from /system.slice/nsca_uk.service
Apr 16 10:01:01 localhost journal: Suppressed 9067 messages from /system.slice/nsca_uk.service
Apr 16 10:01:27 localhost nagios: job 3274 (pid=2293): read() returned error 11
Apr 16 10:01:31 localhost journal: Suppressed 5416 messages from /system.slice/nsca_uk.service
Apr 16 10:01:31 localhost journal: Suppressed 1710 messages from /system.slice/nsca_uk.service
Apr 16 10:01:31 localhost journal: Suppressed 9120 messages from /system.slice/nsca_uk.service
Apr 16 10:01:51 localhost nagios: job 3278 (pid=2572): read() returned error 11
Apr 16 10:02:01 localhost journal: Suppressed 8685 messages from /system.slice/nsca_uk.service
Apr 16 10:02:01 localhost journal: Suppressed 5178 messages from /system.slice/nsca_uk.service
Apr 16 10:02:01 localhost journal: Suppressed 1852 messages from /system.slice/nsca_uk.service
Apr 16 10:03:32 localhost journal: Suppressed 9009 messages from /system.slice/nsca_uk.service
Apr 16 10:03:32 localhost journal: Suppressed 5529 messages from /system.slice/nsca_uk.service
Apr 16 10:03:32 localhost journal: Suppressed 1943 messages from /system.slice/nsca_uk.service
Apr 16 10:03:32 localhost rsyslogd: imjournal: 14826 messages lost due to rate-limiting
Apr 16 10:04:02 localhost journal: Suppressed 9349 messages from /system.slice/nsca_uk.service
Apr 16 10:04:02 localhost journal: Suppressed 5659 messages from /system.slice/nsca_uk.service
Apr 16 10:04:02 localhost journal: Suppressed 1963 messages from /system.slice/nsca_uk.service
Apr 16 10:04:11 localhost nagios: job 3280 (pid=3842): read() returned error 11
Apr 16 10:04:11 localhost nagios: job 3280 (pid=3841): read() returned error 11
Apr 16 10:04:11 localhost nagios: job 3281 (pid=3852): read() returned error 11
Apr 16 10:04:11 localhost nagios: job 3281 (pid=3851): read() returned error 11
Apr 16 10:04:11 localhost nagios: job 3281 (pid=3868): read() returned error 11
Apr 16 10:04:11 localhost nagios: job 3281 (pid=3873): read() returned error 11
Apr 16 10:04:33 localhost journal: Suppressed 9770 messages from /system.slice/nsca_uk.service
Apr 16 10:04:33 localhost journal: Suppressed 5929 messages from /system.slice/nsca_uk.service
Apr 16 10:04:33 localhost journal: Suppressed 2093 messages from /system.slice/nsca_uk.service
Apr 16 10:05:03 localhost journal: Suppressed 9688 messages from /system.slice/nsca_uk.service
Apr 16 10:05:03 localhost journal: Suppressed 5792 messages from /system.slice/nsca_uk.service
Apr 16 10:05:03 localhost journal: Suppressed 1986 messages from /system.slice/nsca_uk.service
Apr 16 10:13:34 localhost rsyslogd: imjournal: 44375 messages lost due to rate-limiting
Apr 16 10:14:25 localhost nagios: job 3301 (pid=9321): read() returned error 11
Apr 16 10:14:25 localhost nagios: job 3302 (pid=9327): read() returned error 11
Have you any idea, what to check next for working solution?
Thank you for your effort.