I have been looking for help all around the web on this issue I am currently experiencing with the Passive Check implementation using NSCA.
We build our entire Nagios infrastructure around the NSCA feature, so our "Nagios Central" node has the daemon running at all times and receives all the send_nsca checks from nagios distributed nodes (3 in each AWS region, and 1 Nagios central in total).
Every morning I keep finding all our checks are in red CRITICAL stauts with the message "CRITICAL: nsca problem". The machine it runs on is Amazon Linux 2 (rhel 7 based) so then I found that restarting the nsca daemon i.e `systemctl restart nsca` makes all the checks go back to normal... until it happens again.
Of course I could just put a cronjob that restarts nsca every midnight or so but that just sounds ugly and not my idea.
I am struggling to find out any traces of exceptions that nsca is vomiting when this happens? Perhaps someone here can help me out unravel this mystery?
-------------------------------------
Example of the problem:
Code: Select all
[1542889592] SERVICE ALERT: uw2-uat-devnode1;Disk Usage;CRITICAL;SOFT;1;CRITICAL: NSCA Problem
[1542889592] SERVICE ALERT: uw2-uat-devnode2;Disk Usage;CRITICAL;SOFT;1;CRITICAL: NSCA Problem
[1542889592] SERVICE ALERT: uw2-uat-prodnode1;Disk Usage;CRITICAL;SOFT;1;CRITICAL: NSCA Problem
[1542889592] SERVICE ALERT: uw2-uat-prodnode2;Disk Usage;CRITICAL;SOFT;1;CRITICAL: NSCA Problem
/etc/nagios/nsca.cfg =
Code: Select all
log_facility=syslog
pid_file=/var/run/nagios/nsca.pid
server_port=5667
nsca_user=nagios
nsca_group=nagios
debug=0
command_file=/var/spool/nagios/cmd/nagios.cmd
alternate_dump_file=/var/spool/nagios/cmd/nsca.dump
aggregate_writes=0
append_to_file=0
max_packet_age=30
password=-----------------------
decryption_method=3
/etc/sysconfig/nsca =
Code: Select all
# This file is used to set NSCA daemon options.
#
# Options:
# --inetd = Run as a service under inetd or xinetd
# --daemon = Run as a standalone multi-process daemon
# --single = Run as a standalone single-process daemon (default)
OPTIONS=""
Any typical clue as to why the daemon can go flaky and makes all checks give false positives like this? Where would I possibly find out? As you may see from above, my nsca config is just basic I'm not really doing much else.
PD: all this gets auto applied from a Puppet module.
Cheers,