Re: Check did not exit properly / Failed to register iobroke
Posted: Thu Sep 07, 2017 8:47 am
I'm affraid I can't discover a pattern of any kind. I've been chewing on this for over a week now.
What I noticed is that Nagios has to be running for a few hours before a major failure occurs. I have done some juggling with the logfiles of last few days. The result is that checks that get executed more tend to fail more. Like those bandwith checks, that are executed every minute, fail ten times more often than checks that are executed every 10 minutes. To be honest, I don't think it's in the checks, also because those same checks a functioning well on our old Nagios 3 box.
The virtual server runs on VMWare ESX 6.0.0.5572656
iowait is nicely low, maybe because the ESX is running on SSD's.
While tailing the logfile I just witnessed this, all in one second.
Note 1: The complete one-second-fragment is in the attachement.
Note 2: The human readable date is part of my tail command, it's not in the actual logfile.
What I noticed is that Nagios has to be running for a few hours before a major failure occurs. I have done some juggling with the logfiles of last few days. The result is that checks that get executed more tend to fail more. Like those bandwith checks, that are executed every minute, fail ten times more often than checks that are executed every 10 minutes. To be honest, I don't think it's in the checks, also because those same checks a functioning well on our old Nagios 3 box.
The virtual server runs on VMWare ESX 6.0.0.5572656
Code: Select all
uname -a: Linux vm-nagios 4.4.79-19-default #1 SMP Thu Aug 10 20:28:47 UTC 2017 (2dd03e8) x86_64 x86_64 x86_64 GNU/Linux
OS: openSUSE Leap 42.3
CPUs: 4
RAM: 4 GB
MemTotal: 4021572 kB
MemFree: 1613088 kB
MemAvailable: 3234260 kB
Code: Select all
top - 15:32:02 up 4:24, 3 users, load average: 3.06, 3.02, 3.17
Tasks: 307 total, 4 running, 303 sleeping, 0 stopped, 0 zombie
%Cpu0 : 41.9 us, 30.6 sy, 0.0 ni, 27.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 46.1 us, 27.6 sy, 0.0 ni, 25.9 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu2 : 43.6 us, 29.9 sy, 0.0 ni, 26.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 52.3 us, 22.7 sy, 0.0 ni, 25.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 4021572 total, 2385472 used, 1636100 free, 2108 buffers
KiB Swap: 0 total, 0 used, 0 free. 1870824 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
60058 nagios 20 0 34356 26412 2228 S 2.990 0.657 1:14.45 nagios
55375 nagios 20 0 11912 2724 2320 S 2.658 0.068 0:00.36 check_snmp_cisc
71803 nagios 20 0 11916 2716 2340 S 1.329 0.068 0:00.04 check_snmp_cisc
751 root 20 0 12024 4788 1544 S 0.664 0.119 1:16.93 haveged
1380 message+ 20 0 41016 4996 3540 S 0.664 0.124 0:24.86 dbus-daemon
3160 gdm 20 0 1610392 153208 76088 S 0.664 3.810 0:29.23 gnome-shell
72885 nagios 20 0 19080 5660 4304 R 0.664 0.141 0:00.02 snmpget
Note 1: The complete one-second-fragment is in the attachement.
Note 2: The human readable date is part of my tail command, it's not in the actual logfile.
Code: Select all
2017-09-07 15:15:04 [1504790104] HOST ALERT: CORE-BB;DOWN;SOFT;1;(Host check did not exit properly)
2017-09-07 15:15:04 [1504790104] Warning: Check of service 'check_time_offset' on host 'vm-vcenter6' did not exit properly!
2017-09-07 15:15:04 [1504790104] Warning: Check of service 'check_disks_snmp' on host 'vm-vibe-idx1' did not exit properly!
2017-09-07 15:15:04 [1504790104] SERVICE ALERT: vm-vibe-idx1;check_disks_snmp;CRITICAL;SOFT;1;(Service check did not exit properly)
2017-09-07 15:15:04 [1504790104] wproc: Core Worker 5614: Failed to register iobroker for stdout
2017-09-07 15:15:04 [1504790104] wproc: Core Worker 5614: Failed to register iobroker for stderr