Nagios Log Server listening port abruptly halts v2

Post by **cdienger** » Fri May 19, 2017 9:34 am

Thanks for the update! Crossing my fingers

james.liew · Post by **james.liew** » Mon May 22, 2017 1:16 am

No dice,

It died Sunday morning. Logs from yesterday attached.

Post by **cdienger** » Mon May 22, 2017 2:53 pm

Good news is that I don't see any memory problems so we've at least helped that a bit.

We do see the same error messages that Matt pointed out earlier:

{:timestamp=>"2017-05-20T22:40:19.003000+0200", :message=>"Got error to send bulk of actions: None of the configured nodes are available: []", :level=>:error}
{:timestamp=>"2017-05-20T22:40:19.003000+0200", :message=>"Failed to flush outgoing items", :outgoing_count=>104, :exception=>org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes are available: ...

And additionally:

Code: Select all

{:timestamp=>"2017-05-20T18:13:32.058000+0200", :message=>"An error occurred. Closing connection", :client=>"172.17.17.1:49809", :exception=>#<LogStash::ShutdownSignal: LogStash::ShutdownSignal>...

and:

Code: Select all

{:timestamp=>"2017-05-20T22:40:09.370000+0200", :message=>"Failed to install template: None of the configured nodes are available: []", :level=>:error}

How many clients are sending logs to NLS? How many instances of LS are there? Let's get logstash logs and elasticsearch logs from all instances as well as System Profiles(Administration > System > System Status).

james.liew · Post by **james.liew** » Thu May 25, 2017 3:34 am

I have a pair of logservers(in a cluster)) logging nodes local to their location at each DC.

However, this issue is only occurring on one DC. The rest are not affected.

I have around 90 nodes logging to NLS. Including both network/security devices and Windows/Linux.

This port (3515) is used exclusively by our Windows hosts.

EDIT: Based on monitoring, it's happening every 5 days.

avandemore · Post by **avandemore** » Thu May 25, 2017 2:50 pm

What is the output of on NLS:

Code: Select all

# top -bcn1

james.liew · Post by **james.liew** » Sun May 28, 2017 8:25 pm

avandemore wrote:What is the output of on NLS:
Code: Select all
# top -bcn1

[root@hs1-log-01 ~]# top -bcn1
top - 03:24:59 up 23 days, 16:11, 1 user, load average: 0.08, 0.07, 0.05
Tasks: 134 total, 1 running, 133 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.3 us, 0.3 sy, 0.8 ni, 97.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 16268312 total, 292148 free, 10130960 used, 5845204 buff/cache
KiB Swap: 7815164 total, 7768884 free, 46280 used. 4787584 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 190844 3268 2100 S 0.0 0.0 19:18.50 /usr/lib/systemd/systemd --switched-root --system --deserialize 20
2 root 20 0 0 0 0 S 0.0 0.0 0:00.29 [kthreadd]
3 root 20 0 0 0 0 S 0.0 0.0 0:03.20 [ksoftirqd/0]
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [kworker/0:0H]
7 root rt 0 0 0 0 S 0.0 0.0 0:07.96 [migration/0]
8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [rcu_bh]
9 root 20 0 0 0 0 S 0.0 0.0 4:36.19 [rcu_sched]
10 root rt 0 0 0 0 S 0.0 0.0 0:03.53 [watchdog/0]
11 root rt 0 0 0 0 S 0.0 0.0 0:02.99 [watchdog/1]
12 root rt 0 0 0 0 S 0.0 0.0 0:07.59 [migration/1]
13 root 20 0 0 0 0 S 0.0 0.0 0:08.38 [ksoftirqd/1]
15 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [kworker/1:0H]
16 root rt 0 0 0 0 S 0.0 0.0 0:03.51 [watchdog/2]
17 root rt 0 0 0 0 S 0.0 0.0 0:07.49 [migration/2]
18 root 20 0 0 0 0 S 0.0 0.0 0:09.75 [ksoftirqd/2]
20 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [kworker/2:0H]
21 root rt 0 0 0 0 S 0.0 0.0 0:03.55 [watchdog/3]
22 root rt 0 0 0 0 S 0.0 0.0 0:07.21 [migration/3]
23 root 20 0 0 0 0 S 0.0 0.0 0:18.46 [ksoftirqd/3]
25 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [kworker/3:0H]
27 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [khelper]
28 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [kdevtmpfs]
29 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [netns]
30 root 20 0 0 0 0 S 0.0 0.0 0:03.69 [khungtaskd]
31 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [writeback]
32 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [kintegrityd]
33 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [bioset]
34 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [kblockd]
35 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [md]
41 root 20 0 0 0 0 S 0.0 0.0 0:31.72 [kswapd0]
42 root 25 5 0 0 0 S 0.0 0.0 0:00.00 [ksmd]
43 root 39 19 0 0 0 S 0.0 0.0 0:40.78 [khugepaged]
44 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [fsnotify_mark]
45 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [crypto]
53 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [kthrotld]
56 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [kmpath_rdacd]
57 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [kpsmoused]
59 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [ipv6_addrconf]
78 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [deferwq]
112 root 20 0 0 0 0 S 0.0 0.0 0:20.54 [kauditd]
298 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [ata_sff]
299 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [scsi_eh_0]
300 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [scsi_tmf_0]
301 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [vmw_pvscsi_wq_0]
304 root 20 0 0 0 0 S 0.0 0.0 0:00.01 [scsi_eh_1]
305 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [scsi_tmf_1]
306 root 20 0 0 0 0 S 0.0 0.0 0:00.01 [scsi_eh_2]
307 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [scsi_tmf_2]
311 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [ttm_swap]
375 root 0 -20 0 0 0 S 0.0 0.0 0:00.31 [kworker/2:1H]
382 root 0 -20 0 0 0 S 0.0 0.0 0:00.87 [kworker/3:1H]
418 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [kdmflush]
419 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [bioset]
428 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [kdmflush]
429 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [bioset]
433 root 0 -20 0 0 0 S 0.0 0.0 0:00.51 [kworker/1:1H]
447 root 20 0 0 0 0 S 0.0 0.0 0:00.19 [jbd2/dm-0-8]
448 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [ext4-rsv-conver]
523 root 20 0 77776 39536 39384 S 0.0 0.2 8:30.34 /usr/lib/systemd/systemd-journald
538 root 20 0 274552 752 752 S 0.0 0.0 0:00.01 /usr/sbin/lvmetad -f
539 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [rpciod]
548 root 20 0 43524 988 984 S 0.0 0.0 0:00.07 /usr/lib/systemd/systemd-udevd
638 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [jbd2/sda1-8]
639 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [ext4-rsv-conver]
641 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [kdmflush]
642 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [bioset]
643 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [kdmflush]
645 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [bioset]
658 root 0 -20 0 0 0 S 0.0 0.0 0:01.57 [kworker/0:1H]
665 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [jbd2/dm-3-8]
666 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [ext4-rsv-conver]
668 root 20 0 0 0 0 S 0.0 0.0 3:03.42 [jbd2/dm-2-8]
669 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [ext4-rsv-conver]
673 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [kdmflush]
674 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [bioset]
675 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [kdmflush]
676 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [bioset]
697 root 20 0 0 0 0 S 0.0 0.0 0:11.01 [jbd2/dm-5-8]
698 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [ext4-rsv-conver]
700 root 20 0 0 0 0 S 0.0 0.0 0:25.66 [jbd2/dm-4-8]
701 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [ext4-rsv-conver]
721 root 16 -4 55416 1440 1316 S 0.0 0.0 0:42.88 /sbin/auditd -n
744 dbus 20 0 24744 1416 1184 S 0.0 0.0 2:58.20 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
746 chrony 20 0 115848 1328 1232 S 0.0 0.0 0:00.94 /usr/sbin/chronyd
748 root 20 0 435412 4304 3716 S 0.0 0.0 0:31.61 /usr/sbin/NetworkManager --no-daemon
749 root 20 0 24304 1516 1304 S 0.0 0.0 1:40.93 /usr/lib/systemd/systemd-logind
751 root 20 0 19312 1080 920 S 0.0 0.0 0:39.43 /usr/sbin/irqbalance --foreground
755 root 20 0 308264 5380 3604 S 0.0 0.0 13:35.55 /usr/bin/vmtoolsd
758 polkitd 20 0 527640 3752 3340 S 0.0 0.0 1:01.18 /usr/lib/polkit-1/polkitd --no-debug
767 root 20 0 201212 616 616 S 0.0 0.0 0:01.60 /usr/sbin/gssproxy -D
1012 root 20 0 833052 23924 23384 S 0.0 0.1 15:59.85 /usr/sbin/rsyslogd -n
1015 root 20 0 113340 552 468 S 0.0 0.0 0:00.03 /usr/bin/rhsmcertd
1017 root 20 0 553152 4844 4400 S 0.0 0.0 2:10.20 /usr/bin/python -Es /usr/sbin/tuned -l -P
1049 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [cifsiod]
1053 root 20 0 82468 556 488 S 0.0 0.0 0:00.00 /usr/sbin/sshd
1055 root 20 0 111704 312 244 S 0.0 0.0 0:45.66 /usr/sbin/keepalived -D
1056 root 20 0 111704 1212 1152 S 0.0 0.0 0:48.46 /usr/sbin/keepalived -D
1057 root 20 0 111704 392 332 S 0.0 0.0 1:48.36 /usr/sbin/keepalived -D
1086 root 20 0 108420 920 688 S 0.0 0.0 0:26.73 sendmail: accepting connections
1104 smmsp 20 0 103836 556 420 S 0.0 0.0 0:00.20 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
1106 root 20 0 0 0 0 S 0.0 0.0 0:12.51 [cifsd]
1114 root 20 0 319472 12556 7660 S 0.0 0.1 0:45.35 /usr/sbin/httpd -DFOREGROUND
1122 root 20 0 126224 1100 960 S 0.0 0.0 0:13.53 /usr/sbin/crond -n
1123 root 20 0 25844 748 732 S 0.0 0.0 0:00.00 /usr/sbin/atd -f
1138 root 20 0 107912 532 416 S 0.0 0.0 0:00.01 rhnsd
1172 root 20 0 110036 684 680 S 0.0 0.0 0:00.00 /sbin/agetty --noclear tty1 linux
1462 root 20 0 239988 17412 1836 S 0.0 0.1 1:01.50 /usr/local/ncpa/ncpa_posix_passive --start
1472 root 20 0 346128 21104 3228 S 0.0 0.1 2:14.81 ./ncpa_posix_listener --start
9295 root 20 0 0 0 0 S 0.0 0.0 0:00.07 [kworker/1:0]
10610 root 39 19 144248 1524 1200 S 0.0 0.0 0:00.00 runuser -s /bin/sh -c exec /usr/local/nagioslogserver/logstash/bin/logstash agent -f /usr+
10612 nagios 39 19 10.062g 752080 14660 S 0.0 4.6 222:54.30 java -Djava.io.tmpdir=/usr/local/nagioslogserver/tmp -Djava.io.tmpdir=/usr/local/nagioslo+
10727 nagios 20 0 22.541g 8.479g 139292 S 0.0 54.7 302:42.35 java -Xms7943m -Xmx7943m -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepG+
14404 root 20 0 0 0 0 S 0.0 0.0 0:00.15 [kworker/3:2]
18190 root 20 0 0 0 0 S 0.0 0.0 0:00.04 [kworker/2:0]
19022 root 20 0 0 0 0 S 0.0 0.0 0:05.05 [kworker/u8:0]
20655 root 20 0 0 0 0 S 0.0 0.0 0:01.35 [kworker/0:0]
21502 apache 20 0 323352 13652 4228 S 0.0 0.1 0:00.14 /usr/sbin/httpd -DFOREGROUND
21503 apache 20 0 323352 13652 4228 S 0.0 0.1 0:00.17 /usr/sbin/httpd -DFOREGROUND
21504 apache 20 0 323352 13652 4228 S 0.0 0.1 0:00.15 /usr/sbin/httpd -DFOREGROUND
21505 apache 20 0 323352 13652 4228 S 0.0 0.1 0:00.20 /usr/sbin/httpd -DFOREGROUND
21506 apache 20 0 323352 13652 4228 S 0.0 0.1 0:00.17 /usr/sbin/httpd -DFOREGROUND
22774 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [kworker/3:0]
22962 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [kworker/u8:2]
24370 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [kworker/1:1]
25988 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [kworker/2:1]
26170 root 20 0 123196 736 504 S 0.0 0.0 0:00.00 /usr/sbin/anacron -s
29377 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [kworker/0:2]
30104 root 20 0 178120 2068 1552 S 0.0 0.0 0:00.00 /usr/sbin/CROND -n
30105 nagios 20 0 113120 1208 1028 S 0.0 0.0 0:00.00 /bin/sh -c /usr/bin/php -q /var/www/html/nagioslogserver/www/index.php poller > /usr/loca+
30107 nagios 20 0 242152 14164 7056 S 0.0 0.1 0:00.03 /usr/bin/php -q /var/www/html/nagioslogserver/www/index.php poller
30222 root 20 0 149992 6044 4684 S 0.0 0.0 0:00.02 sshd: root@pts/0
30226 root 20 0 0 0 0 S 0.0 0.0 0:00.00 [kworker/0:1]
30265 root 20 0 115516 2096 1620 S 0.0 0.0 0:00.00 -bash
30283 root 20 0 157600 2160 1576 R 0.0 0.0 0:00.00 top -bcn1

james.liew · Post by **james.liew** » Mon May 29, 2017 7:20 pm

Looks like it died again this morning. Nagios XI t didn't trigger any listening port errors so the port is up but NLS isn't capturing logs on port 3515.

Highly odd.

Post by **mcapra** » Tue May 30, 2017 8:54 am

Every time this happens, it's super duper helpful to have quite literally all available Logstash and Elasticsearch logs. From each instance, from each available day, etc. There could be something happening internally in Elasticsearch that's either not present in the Logstash log, or something that happened several hours/days prior that set Logstash up for failure down the road.

I recall chasing an error in Logstash that was the result of irresponsible connection termination, creating a situation where lots of stale connections were left open which made the operating system in many machines upset:
https://github.com/elastic/logstash/issues/4815
https://github.com/elastic/logstash/issues/4225

I don't know if this is the exact problem you're encountering, but it might be worth trying rolling restarts of the Logstash service on your machines on a daily basis (nothing more sophisticated than a cron job) with the intention of purging these stale connections.

You might also try adjusting the LS_OPEN_FILES limit as well as your systems open file limit. You'll usually see different exceptions thrown when it's a system file limit issue, but it's worth a shot. Per this article:
https://support.nagios.com/kb/article/n ... dying.html

tmcdonald · Post by **tmcdonald** » Tue May 30, 2017 10:47 am

Thanks for the post, @mcapra! OP, let us know if you have further questions.

james.liew · Post by **james.liew** » Tue May 30, 2017 7:03 pm

mcapra wrote:Every time this happens, it's super duper helpful to have quite literally all available Logstash and Elasticsearch logs. From each instance, from each available day, etc. There could be something happening internally in Elasticsearch that's either not present in the Logstash log, or something that happened several hours/days prior that set Logstash up for failure down the road.

I recall chasing an error in Logstash that was the result of irresponsible connection termination, creating a situation where lots of stale connections were left open which made the operating system in many machines upset:
https://github.com/elastic/logstash/issues/4815
https://github.com/elastic/logstash/issues/4225

I don't know if this is the exact problem you're encountering, but it might be worth trying rolling restarts of the Logstash service on your machines on a daily basis (nothing more sophisticated than a cron job) with the intention of purging these stale connections.

You might also try adjusting the LS_OPEN_FILES limit as well as your systems open file limit. You'll usually see different exceptions thrown when it's a system file limit issue, but it's worth a shot. Per this article:
https://support.nagios.com/kb/article/n ... dying.html

I've tried running a cron job to restart the services but they always seem to fail AFTER being scheduled, quite likely that i've made a mistake somewhere in the cron job. I'll look into this again.

What I've noticed is that the job seems to fail every 5 days or so, if I look at the CPU/MEM utilization graph and RAM/CPU always drops after a restart.

Nagios Support Forum

Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2