`CRITICAL: NSCA Problem` core 4.3.4 + nsca 2.9.2-1

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
gonz86x
Posts: 2
Joined: Thu Nov 22, 2018 8:38 am

`CRITICAL: NSCA Problem` core 4.3.4 + nsca 2.9.2-1

Post by gonz86x »

Hello everyone,

I have been looking for help all around the web on this issue I am currently experiencing with the Passive Check implementation using NSCA.

We build our entire Nagios infrastructure around the NSCA feature, so our "Nagios Central" node has the daemon running at all times and receives all the send_nsca checks from nagios distributed nodes (3 in each AWS region, and 1 Nagios central in total).

Every morning I keep finding all our checks are in red CRITICAL stauts with the message "CRITICAL: nsca problem". The machine it runs on is Amazon Linux 2 (rhel 7 based) so then I found that restarting the nsca daemon i.e `systemctl restart nsca` makes all the checks go back to normal... until it happens again.

Of course I could just put a cronjob that restarts nsca every midnight or so but that just sounds ugly and not my idea.

I am struggling to find out any traces of exceptions that nsca is vomiting when this happens? Perhaps someone here can help me out unravel this mystery?

-------------------------------------
Example of the problem:

Code: Select all

[1542889592] SERVICE ALERT: uw2-uat-devnode1;Disk Usage;CRITICAL;SOFT;1;CRITICAL: NSCA Problem
[1542889592] SERVICE ALERT: uw2-uat-devnode2;Disk Usage;CRITICAL;SOFT;1;CRITICAL: NSCA Problem
[1542889592] SERVICE ALERT: uw2-uat-prodnode1;Disk Usage;CRITICAL;SOFT;1;CRITICAL: NSCA Problem
[1542889592] SERVICE ALERT: uw2-uat-prodnode2;Disk Usage;CRITICAL;SOFT;1;CRITICAL: NSCA Problem
-------------------------------------

/etc/nagios/nsca.cfg =

Code: Select all

log_facility=syslog
pid_file=/var/run/nagios/nsca.pid
server_port=5667
nsca_user=nagios
nsca_group=nagios
debug=0
command_file=/var/spool/nagios/cmd/nagios.cmd
alternate_dump_file=/var/spool/nagios/cmd/nsca.dump
aggregate_writes=0
append_to_file=0
max_packet_age=30
password=-----------------------
decryption_method=3

/etc/sysconfig/nsca =

Code: Select all

# This file is used to set NSCA daemon options.
# 
# Options:
#   --inetd     = Run as a service under inetd or xinetd
#   --daemon    = Run as a standalone multi-process daemon
#   --single    = Run as a standalone single-process daemon (default)
OPTIONS=""


Any typical clue as to why the daemon can go flaky and makes all checks give false positives like this? Where would I possibly find out? As you may see from above, my nsca config is just basic I'm not really doing much else.

PD: all this gets auto applied from a Puppet module.




Cheers,
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: `CRITICAL: NSCA Problem` core 4.3.4 + nsca 2.9.2-1

Post by scottwilkerson »

Clearly if restarting the nsca service solves the issue it is in fact a problem with the service.

What is causing the problem is going to be trickier to solve, but being it is set to log to syslog per your configuration, I would start there

Code: Select all

grep nsca /var/log/messages
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
gonz86x
Posts: 2
Joined: Thu Nov 22, 2018 8:38 am

Re: `CRITICAL: NSCA Problem` core 4.3.4 + nsca 2.9.2-1

Post by gonz86x »

Hey... yes, I also am 100% positive it is a problem with the service, the problem being that everytime it happens, I check the service (via systemctl) and it looks as if it is running fine, and the process via ps as well shows up there.

Then on the syslog I am only able to see the aforementioned error log and nothing else more helpful.

In the meantime, I am having to resort to an ugly cronjob to restart nsca every 5 mins.... yuck! It has stopped the dashboard from going all "CRITICAL NSCA PROBLEM" at least!

If someone ever experienced this or has an idea what it could be or where to keep looking for I'm all here, maybe enabling debug mode gives more insight into it? I would be really thankful.

I'm also just thinking passive checks with nsca is not really ... a reliable stable thing? it's making me reconsider another tool like Prometheus for my use case (hundreds of ec2 instances coming up and down, and we run puppet on the central node every 5 minutes to auto add and remove host definitions for those).


Thanks all for your attention.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: `CRITICAL: NSCA Problem` core 4.3.4 + nsca 2.9.2-1

Post by scottwilkerson »

Is NSCA running as a stand-alone service or under xinetd ?

Yes turning on the debug in the nsca.cfg would be my next suggestion.

Code: Select all

debug=0
then restart the service. This will write the debug output to syslog
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
Locked