Nagios Support Forum

Posted: **Thu Sep 07, 2017 8:47 am**

I'm affraid I can't discover a pattern of any kind. I've been chewing on this for over a week now.
What I noticed is that Nagios has to be running for a few hours before a major failure occurs. I have done some juggling with the logfiles of last few days. The result is that checks that get executed more tend to fail more. Like those bandwith checks, that are executed every minute, fail ten times more often than checks that are executed every 10 minutes. To be honest, I don't think it's in the checks, also because those same checks a functioning well on our old Nagios 3 box.

The virtual server runs on VMWare ESX 6.0.0.5572656

Code: Select all

uname -a: 	Linux vm-nagios 4.4.79-19-default #1 SMP Thu Aug 10 20:28:47 UTC 2017 (2dd03e8) x86_64 x86_64 x86_64 GNU/Linux
OS: 	 	  openSUSE Leap 42.3
CPUs: 		 4
RAM: 		  4 GB
  MemTotal:        4021572 kB
  MemFree:         1613088 kB
  MemAvailable:    3234260 kB

iowait is nicely low, maybe because the ESX is running on SSD's.

Code: Select all

top - 15:32:02 up  4:24,  3 users,  load average: 3.06, 3.02, 3.17
Tasks: 307 total,   4 running, 303 sleeping,   0 stopped,   0 zombie
%Cpu0  : 41.9 us, 30.6 sy,  0.0 ni, 27.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 46.1 us, 27.6 sy,  0.0 ni, 25.9 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu2  : 43.6 us, 29.9 sy,  0.0 ni, 26.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 52.3 us, 22.7 sy,  0.0 ni, 25.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   4021572 total,  2385472 used,  1636100 free,     2108 buffers
KiB Swap:        0 total,        0 used,        0 free.  1870824 cached Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 60058 nagios    20   0   34356  26412   2228 S 2.990 0.657   1:14.45 nagios
 55375 nagios    20   0   11912   2724   2320 S 2.658 0.068   0:00.36 check_snmp_cisc
 71803 nagios    20   0   11916   2716   2340 S 1.329 0.068   0:00.04 check_snmp_cisc
   751 root      20   0   12024   4788   1544 S 0.664 0.119   1:16.93 haveged
  1380 message+  20   0   41016   4996   3540 S 0.664 0.124   0:24.86 dbus-daemon
  3160 gdm       20   0 1610392 153208  76088 S 0.664 3.810   0:29.23 gnome-shell
 72885 nagios    20   0   19080   5660   4304 R 0.664 0.141   0:00.02 snmpget

While tailing the logfile I just witnessed this, all in one second.
Note 1: The complete one-second-fragment is in the attachement.
Note 2: The human readable date is part of my tail command, it's not in the actual logfile.

Code: Select all

2017-09-07 15:15:04 [1504790104] HOST ALERT: CORE-BB;DOWN;SOFT;1;(Host check did not exit properly)
2017-09-07 15:15:04 [1504790104] Warning: Check of service 'check_time_offset' on host 'vm-vcenter6' did not exit properly!
2017-09-07 15:15:04 [1504790104] Warning: Check of service 'check_disks_snmp' on host 'vm-vibe-idx1' did not exit properly!
2017-09-07 15:15:04 [1504790104] SERVICE ALERT: vm-vibe-idx1;check_disks_snmp;CRITICAL;SOFT;1;(Service check did not exit properly)
2017-09-07 15:15:04 [1504790104] wproc: Core Worker 5614: Failed to register iobroker for stdout
2017-09-07 15:15:04 [1504790104] wproc: Core Worker 5614: Failed to register iobroker for stderr

Posted: **Thu Sep 07, 2017 11:01 am**

I see in your cfg that debug is set to host/service - this may actually be a problem with the workers. Can you set the debug_level to -1 and send me what's generated?

Before you do that, I'd like to see what's in your debug log right this second - and then truncate it - change the options - restart nagios and send the new one over.

Thanks!

Posted: **Fri Sep 08, 2017 4:24 am**

I've set debug to -1.

This is a fragment of the old setting:

Code: Select all

2017-09-08  9:15:05 [1504854905.450313] [016.0] [pid=60058] Attempting to run scheduled check of host 'Eduarte [helicon.educus.nl]': check options=0, latency=0.000000
2017-09-08  9:15:05 [1504854905.450322] [016.0] [pid=60058] archives nagios.configtest nagios.debug nagios.log objects.cache objects.precache retention.dat rw spool tmp Running async check of host 'Eduarte [helicon.educus.nl]'...
2017-09-08  9:15:05 [1504854905.450327] [016.0] [pid=60058] Host 'Eduarte [helicon.educus.nl]' passed first hurdle (caching/execution)
2017-09-08  9:15:05 [1504854905.450338] [016.0] [pid=60058] Checking host 'Eduarte [helicon.educus.nl]'...
2017-09-08  9:15:05 [1504854905.453453] [016.0] [pid=60058] Attempting to run scheduled check of host 'SW-MB-wan [Den Bosch]': check options=0, latency=0.000000
2017-09-08  9:15:05 [1504854905.453475] [016.0] [pid=60058] archives nagios.configtest nagios.debug nagios.log objects.cache objects.precache retention.dat rw spool tmp Running async check of host 'SW-MB-wan [Den Bosch]'...
2017-09-08  9:15:05 [1504854905.453481] [016.0] [pid=60058] Host 'SW-MB-wan [Den Bosch]' passed first hurdle (caching/execution)
2017-09-08  9:15:05 [1504854905.453504] [016.0] [pid=60058] Checking host 'SW-MB-wan [Den Bosch]'...
2017-09-08  9:15:05 [1504854905.454611] [016.0] [pid=60058] Attempting to run scheduled check of host 'SW-MT-WAN': check options=0, latency=0.000000
2017-09-08  9:15:05 [1504854905.454626] [016.0] [pid=60058] archives nagios.configtest nagios.debug nagios.log objects.cache objects.precache retention.dat rw spool tmp Running async check of host 'SW-MT-WAN'...
2017-09-08  9:15:05 [1504854905.454632] [016.0] [pid=60058] Host 'SW-MT-WAN' passed first hurdle (caching/execution)
2017-09-08  9:15:05 [1504854905.454644] [016.0] [pid=60058] Checking host 'SW-MT-WAN'...
2017-09-08  9:15:05 [1504854905.461762] [016.0] [pid=60058] Attempting to run scheduled check of service 'check_disks_snmp' on host 'vm-vibe-tst-web1': check options=0, latency=0.000000
2017-09-08  9:15:05 [1504854905.461796] [016.0] [pid=60058] Checking service 'check_disks_snmp' on host 'vm-vibe-tst-web1'...
2017-09-08  9:15:05 [1504854905.480943] [016.0] [pid=60058] Attempting to run scheduled check of host 'SW-SRA-03': check options=0, latency=0.000000
2017-09-08  9:15:05 [1504854905.480977] [016.0] [pid=60058] archives nagios.configtest nagios.debug nagios.log objects.cache objects.precache retention.dat rw spool tmp Running async check of host 'SW-SRA-03'...
2017-09-08  9:15:05 [1504854905.480984] [016.0] [pid=60058] Host 'SW-SRA-03' passed first hurdle (caching/execution)
2017-09-08  9:15:05 [1504854905.481001] [016.0] [pid=60058] Checking host 'SW-SRA-03'...
2017-09-08  9:15:05 [1504854905.658266] [016.0] [pid=60058] Attempting to run scheduled check of host 'vm-cardaccess2': check options=0, latency=0.000000
2017-09-08  9:15:05 [1504854905.658296] [016.0] [pid=60058] archives nagios.configtest nagios.debug nagios.log objects.cache objects.precache retention.dat rw spool tmp Running async check of host 'vm-cardaccess2'...
2017-09-08  9:15:05 [1504854905.658302] [016.0] [pid=60058] Host 'vm-cardaccess2' passed first hurdle (caching/execution)
2017-09-08  9:15:05 [1504854905.658318] [016.0] [pid=60058] Checking host 'vm-cardaccess2'...
2017-09-08  9:15:05 [1504854905.957703] [016.0] [pid=60058] Attempting to run scheduled check of service 'check_memory' on host 'vm-vibe-web1': check options=0, latency=0.000000
2017-09-08  9:15:05 [1504854905.957788] [016.0] [pid=60058] Checking service 'check_memory' on host 'vm-vibe-web1'...
2017-09-08  9:15:06 [1504854906.020347] [016.0] [pid=60058] archives nagios.configtest nagios.debug nagios.log objects.cache objects.precache retention.dat rw spool tmp Handling check result for service 'check_memory' on host 'vm-vibe-web1' from 'Core Worker 5608'...
2017-09-08  9:15:06 [1504854906.020420] [016.0] [pid=60058] Scheduling a non-forced, active check of service 'check_memory' on host 'vm-vibe-web1' @ Fri Sep 8 09:25:05 2017
2017-09-08  9:15:06 [1504854906.031555] [016.0] [pid=60058] archives nagios.configtest nagios.debug nagios.log objects.cache objects.precache retention.dat rw spool tmp Handling check result for service 'check_disks_snmp' on host 'vm-vibe-tst-web1' from 'Core Worker 5622'...
2017-09-08  9:15:06 [1504854906.031623] [016.0] [pid=60058] Scheduling a non-forced, active check of service 'check_disks_snmp' on host 'vm-vibe-tst-web1' @ Fri Sep 8 09:25:05 2017
2017-09-08  9:15:06 [1504854906.123828] [016.0] [pid=60058] Attempting to run scheduled check of service 'check_interface_utilisation' on host 'vt-idm16b': check options=0, latency=0.000000
2017-09-08  9:15:06 [1504854906.123869] [016.0] [pid=60058] Checking service 'check_interface_utilisation' on host 'vt-idm16b'...
2017-09-08  9:15:06 [1504854906.151757] [016.0] [pid=60058] archives nagios.configtest nagios.debug nagios.log objects.cache objects.precache retention.dat rw spool tmp Handling check result for service 'check_interface_utilisation' on host 'vt-idm16b' from 'Core Worker 5620'...
2017-09-08  9:15:06 [1504854906.151832] [016.0] [pid=60058] Scheduling a non-forced, active check of service 'check_interface_utilisation' on host 'vt-idm16b' @ Fri Sep 8 09:25:06 2017
2017-09-08  9:15:06 [1504854906.154552] [016.0] [pid=60058] archives nagios.configtest nagios.debug nagios.log objects.cache objects.precache retention.dat rw spool tmp Handling check result for service 'ports_bandwith' on host 'PA1' from 'Core Worker 5616'...
2017-09-08  9:15:06 [1504854906.154639] [016.0] [pid=60058] Scheduling a non-forced, active check of service 'ports_bandwith' on host 'PA1' @ Fri Sep 8 09:15:57 2017
2017-09-08  9:15:06 [1504854906.180059] [016.0] [pid=60058] Scheduling a non-forced, active check of host 'vm-vibe-tst-web1' @ Fri Sep 8 09:25:06 2017
2017-09-08  9:15:06 [1504854906.188075] [016.0] [pid=60058] Scheduling a non-forced, active check of host 'vt-idm16b' @ Fri Sep 8 09:25:06 2017
2017-09-08  9:15:06 [1504854906.521553] [016.0] [pid=60058] Attempting to run scheduled check of service 'check_win_available_updates' on host 'vm-zcmdb': check options=0, latency=0.000000

I have started nagios with the debug_level to -1 setting, let it run for 120 seconds. Stop it. The nagios.debug is in the attachement.

Posted: **Fri Sep 08, 2017 1:24 pm**

So, I was experiencing the same problem yesterday after some Nagios component updates (although not Nagios Core) and a significant addition of new hosts/services. Nagios is also running in a container.

After reading through this thread and looking at our metrics. I am pretty sure that the CPU was being throttled based on the cgroup quotas when this was occurring and it looks like there was also a memory based OOM event. I've redeployed the Nagios container with more CPU and memory (due to the increased checks) and so far it seems better, but it hasn't soaked for very long yet.

Posted: **Fri Sep 08, 2017 3:18 pm**

@alexpeeters, was the new information useful for you? Are we ready to close the ticket?

Posted: **Fri Sep 08, 2017 8:43 pm**

Here's just a single example, and it may be nothing, but what is the check command and the correlating command definition of the host/service in the following debug warnings?

Code: Select all

[1504862182.959760] [016.0] [pid=49732] Checking service 'Gi0/3 sw-mh-03' on host 'SW-MH-WAN'...
[1504862182.959764] [001.0] [pid=49732] get_raw_command_line_r()
[1504862182.959773] [001.0] [pid=49732] process_macros_r()
[1504862182.959778] [001.0] [pid=49732] process_macros_r()
[1504862182.959783] [2048.0] [pid=49732]  WARNING: Could not find a macro matching ''!
[1504862182.959787] [2048.0] [pid=49732]  WARNING: An error occurred processing macro ''!

Posted: **Mon Sep 11, 2017 3:06 am**

@dwhitfield not yet, I would like to see if the tips provided actually resolve the issue.

@spkane that's great news, I'll double the resources on the VM and see what happens. Although I'm not sure if that would improve our problem since the CPU load hovers around 35-40% according to VMWare. And the memory usage hovers between 30 and 40%. I'm noting a CPU Ready time in ESX that hovers between 200 and 600 ms, increasing the ammount of CPU's will likely increase the ready time, that isn't necessarily a bad thing for the VM, but my VMWare admin won't like it. I will report my findings.

In the meantime I have changed base/checks.c, in function static int get_host_check_return_code(host *temp_host, check_result *queued_check_result) I have added a sleep(20);

Since the comment states that the code below should never be reached I thought it would'n harm to slow things down when the error would occur.

Code: Select all

                /* if there was some error running the command, just skip it (this shouldn't be happening) */
                else if(queued_check_result->exited_ok == FALSE) {

                        logit(NSLOG_RUNTIME_WARNING, TRUE, "Warning:  Check of host '%s' did not exit properly!\n", temp_host->name);

                        my_free(temp_host->plugin_output);
                        my_free(temp_host->long_plugin_output);
                        my_free(temp_host->perf_data);

                        temp_host->plugin_output = (char *)strdup("(Host check did not exit properly.)");

                        rc = STATE_CRITICAL;
                        sleep(20);
                        }

And I added the same in function static int get_service_check_return_code(service *temp_service, check_result *queued_check_result).

Code: Select all

                /* if there was some error running the command, just skip it (this shouldn't be happening) */
                else if(queued_check_result->exited_ok == FALSE) {

                        logit(NSLOG_RUNTIME_WARNING, TRUE, "Warning:  Check of service '%s' on host '%s' did not exit properly!\n", temp_service->description, temp_service->host_name);

                        my_free(temp_service->plugin_output);
                        my_free(temp_service->long_plugin_output);
                        my_free(temp_service->perf_data);

                        temp_service->plugin_output = (char *)strdup("(Service check did not exit properly.)");

                        rc = STATE_CRITICAL;
                        sleep(20);
                        }

On thursday without the change I had over 2400 host checks improperly exited, and over 10000 service checks, and almost 39000 iobroker registration failures. That's roughly the same as the preceding weeks. I changed checks.c on friday.
I had 0 host checks that exited improperly, and only 10 servicce checks that exited improperly on last sunday, and around 850 iobroker registration failures (saturday was slightly less). It's more like symptom control than anything else, but they resulted in 0 false alerts or flapping notifications.

What I'm going to do now is:
1) Double resources on the VM.
2) Undo my patch to get a fair reading of the amount of errors.
3) Let it run for a day and report back what I see.

Posted: **Mon Sep 11, 2017 4:42 pm**

alexpeeters wrote: 1) Double resources on the VM.
2) Undo my patch to get a fair reading of the amount of errors.
3) Let it run for a day and report back what I see.

Let us know how it goes!

Posted: **Tue Sep 12, 2017 2:19 am**

This is what happend the last 24 hours.
- Memory usage went down to around 10%
- CPU load as reported by VMWare hovers little under 20%
- CPU ready time hovers between 1200 and 1600 ms.

Counting the errors in the nagios logfiles results in:
- 3870 IO broker registration failures.
- 1489 checks that exited improperly.

I got about 300 notification emails about flapping, problem and recovering hosts and services.

I don't think the amount of virtual hardware is the restricting limit in this case. However it got me thinking, it could very well be some operating system limit. I didn't change any of those, they are just the way openSUSE set them up. Does anybody know what limits play a role in this case, and how to measure them?

Posted: **Tue Sep 12, 2017 9:27 am**

alexpeeters wrote:However it got me thinking, it could very well be some operating system limit.

I haven't yet seen the error messages in this thread normally associated with https://support.nagios.com/kb/article.php?id=139 but it might be worth a look.

You have a lot of errors, but how many hosts + services do you have in total?

Also, I notice you don't have a ramdisk (at least not in your nagios.cfg). https://assets.nagios.com/downloads/nag ... giosXI.pdf is written for XI, but should be useful in setting one up if you want.

Nagios Support Forum

Check did not exit properly / Failed to register iobroker

Re: Check did not exit properly / Failed to register iobroke

Re: Check did not exit properly / Failed to register iobroke

Re: Check did not exit properly / Failed to register iobroke

Re: Check did not exit properly / Failed to register iobroke

Re: Check did not exit properly / Failed to register iobroke

Re: Check did not exit properly / Failed to register iobroke

Re: Check did not exit properly / Failed to register iobroke

Re: Check did not exit properly / Failed to register iobroke

Re: Check did not exit properly / Failed to register iobroke

Re: Check did not exit properly / Failed to register iobroke