Hello.
tgriep wrote:Can you post an example of how a typical client is setup?
What commands and scripts is it running and how it is ran?
We use
collectd for collecting all for us interesting metrics. In collectd we use
treshold plugin for metrics evaluation. If is watching metrics OK or worng, collectd sent to Nagios via NSCA this status in 10 seconds interval. In fact, we watching on each server different count of metrics like CPU load, disk space, network status, ...
Collectd call Perl NSCA plugin to parse his output format to NSCA input format and sending to Nagios server.
collectd-tresholds.conf
Code: Select all
LoadPlugin "threshold"
<Plugin "threshold">
<Plugin "processes">
Instance "all"
<Type "ps_count">
DataSource "processes"
FailureMin 1
Invert false
Persist true
PersistOK true
</Type>
</Plugin>
<Plugin "load">
<Type "load">
DataSource "longterm"
WarningMin 5
FailureMin 7
Invert true
Persist true
PersistOK true
</Type>
</Plugin>
<Plugin "df">
Instance "root"
<Type "percent_bytes">
Instance "used"
DataSource "value"
WarningMin 75
FailureMin 90
Invert true
Persist true
PersistOK true
</Type>
</Plugin>
</Plugin>
nagios_nsca.conf
Code: Select all
<LoadPlugin perl>
Globals true
</LoadPlugin>
<Plugin perl>
IncludeDir "/srv/utils/perl"
BaseName "Collectd::Plugins"
LoadPlugin nagios_passive
<Plugin nagios_passive>
debug 0
debugDump 0
</Plugin>
</Plugin>
If you wish, i can share this Perl plugin, but i think, that interesting line is:
Code: Select all
system("echo '$passiv' | send_nsca -H NSCA-IP -p NSCA-PORT -c SENT-NSCA-CONFIG");
On the NSCA server (Nagios) edit the nsca.cfg file and enable debugging by changing this from
So, turn debug mode on.
When i filtered all default messages by command
Code: Select all
grep -v -e "End of connection" -e "Handling the connection" -e "Attempting to write" -e "Time difference" -e "SERVICE CHECK" -e "HOST CHECK" /var/log/messages
I see many suppressed messages

.
Code: Select all
Apr 16 10:00:31 localhost nsca[1772]: Starting up daemon
Apr 16 10:00:31 localhost systemd: Started NSCA for uk cluster.
Apr 16 10:00:35 localhost nagios: job 3273 (pid=1820): read() returned error 11
Apr 16 10:01:01 localhost systemd: Started Session 193 of user root.
Apr 16 10:01:01 localhost journal: Suppressed 5344 messages from /system.slice/nsca_uk.service
Apr 16 10:01:01 localhost journal: Suppressed 1677 messages from /system.slice/nsca_uk.service
Apr 16 10:01:01 localhost journal: Suppressed 9067 messages from /system.slice/nsca_uk.service
Apr 16 10:01:27 localhost nagios: job 3274 (pid=2293): read() returned error 11
Apr 16 10:01:31 localhost journal: Suppressed 5416 messages from /system.slice/nsca_uk.service
Apr 16 10:01:31 localhost journal: Suppressed 1710 messages from /system.slice/nsca_uk.service
Apr 16 10:01:31 localhost journal: Suppressed 9120 messages from /system.slice/nsca_uk.service
Apr 16 10:01:51 localhost nagios: job 3278 (pid=2572): read() returned error 11
Apr 16 10:02:01 localhost journal: Suppressed 8685 messages from /system.slice/nsca_uk.service
Apr 16 10:02:01 localhost journal: Suppressed 5178 messages from /system.slice/nsca_uk.service
Apr 16 10:02:01 localhost journal: Suppressed 1852 messages from /system.slice/nsca_uk.service
Apr 16 10:03:32 localhost journal: Suppressed 9009 messages from /system.slice/nsca_uk.service
Apr 16 10:03:32 localhost journal: Suppressed 5529 messages from /system.slice/nsca_uk.service
Apr 16 10:03:32 localhost journal: Suppressed 1943 messages from /system.slice/nsca_uk.service
Apr 16 10:03:32 localhost rsyslogd: imjournal: 14826 messages lost due to rate-limiting
Apr 16 10:04:02 localhost journal: Suppressed 9349 messages from /system.slice/nsca_uk.service
Apr 16 10:04:02 localhost journal: Suppressed 5659 messages from /system.slice/nsca_uk.service
Apr 16 10:04:02 localhost journal: Suppressed 1963 messages from /system.slice/nsca_uk.service
Apr 16 10:04:11 localhost nagios: job 3280 (pid=3842): read() returned error 11
Apr 16 10:04:11 localhost nagios: job 3280 (pid=3841): read() returned error 11
Apr 16 10:04:11 localhost nagios: job 3281 (pid=3852): read() returned error 11
Apr 16 10:04:11 localhost nagios: job 3281 (pid=3851): read() returned error 11
Apr 16 10:04:11 localhost nagios: job 3281 (pid=3868): read() returned error 11
Apr 16 10:04:11 localhost nagios: job 3281 (pid=3873): read() returned error 11
Apr 16 10:04:33 localhost journal: Suppressed 9770 messages from /system.slice/nsca_uk.service
Apr 16 10:04:33 localhost journal: Suppressed 5929 messages from /system.slice/nsca_uk.service
Apr 16 10:04:33 localhost journal: Suppressed 2093 messages from /system.slice/nsca_uk.service
Apr 16 10:05:03 localhost journal: Suppressed 9688 messages from /system.slice/nsca_uk.service
Apr 16 10:05:03 localhost journal: Suppressed 5792 messages from /system.slice/nsca_uk.service
Apr 16 10:05:03 localhost journal: Suppressed 1986 messages from /system.slice/nsca_uk.service
Apr 16 10:13:34 localhost rsyslogd: imjournal: 44375 messages lost due to rate-limiting
Apr 16 10:14:25 localhost nagios: job 3301 (pid=9321): read() returned error 11
Apr 16 10:14:25 localhost nagios: job 3302 (pid=9327): read() returned error 11
It seems, that all working fine, but NSCA not able handle that many connections right. Especially, when v2.9 client is somewhere installed.
Have you any idea, what to check next for working solution?
Thank you for your effort.