Bugs in (and fixes for) passive check logging

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
eponymousalias
Posts: 17
Joined: Mon Mar 07, 2016 5:38 am

Bugs in (and fixes for) passive check logging

Post by eponymousalias »

Back in March, I reported in these pages a bug in the way that passive service check data is written to the nagios.log file:

Nagios logs bad service-check data
https://support.nagios.com/forum/viewto ... =7&t=61914

For reasons I do not understand, that posting is now locked, and it appears that no action has been taken on it.

At the time, all I had in hand was the symptom, not the cause. Recently, I ran into the same problem again, and found myself sufficiently annoyed that I dug into the code to see what was happening.

First of all, the same problem also affects logging of passive host checks. So both of these types of messages show incorrect data, namely the state and plugin output that were in place before the new passive-check results were processed, not the state and plugin output of the passive check which is purportedly being logged at this time.

Code: Select all

    PASSIVE HOST CHECK
    PASSIVE SERVICE CHECK
Second, a look at the cause in the code, together with a little historical analysis, shows that the problems appeared due to refectoring of base/checks.c in the Nagios 4.4.0 release. Hence the bugs appear in all releases between 4.4.0 and at least 4.4.6, inclusive. To enable you to follow along precisely, I am giving line numbers in the following in terms of what is visible in the unmodified Nagios 4.3.4 and 4.4.6 releases.

In Nagios 4.3.4, we had these blocks of code:
  • Within process_host_check_result():

    Code: Select all

       2489     /* log passive checks - we need to do this here, as some my bypass external commands by getting dropped in checkresults dir */
       2490     if(hst->check_type == CHECK_TYPE_PASSIVE) {
       2491         if(log_passive_checks == TRUE)
       2492             logit(NSLOG_PASSIVE_CHECK, FALSE, "PASSIVE HOST CHECK: %s;%d;%s\n", hst->name, new_state, hst->plugin_output);
       2493         }
    
  • Within handle_async_service_check_result():

    Code: Select all

        529     /* log passive checks - we need to do this here, as some my bypass external commands by getting dropped in checkresults dir */
        530     if(temp_service->check_type == CHECK_TYPE_PASSIVE) {
        531         if(log_passive_checks == TRUE)
        532             logit(NSLOG_PASSIVE_CHECK, FALSE, "PASSIVE SERVICE CHECK: %s;%s;%d;%s\n", temp_service->host_name,
                            temp_service->description, temp_service->current_state, temp_service->plugin_output);
        533         }
    
In both of those blocks, we have state and plugin output data being logged, and it represents the data from the new passive check result being processed.
  • new_state was passed in to process_host_check_result(), and left unmodified since the start of that call. This value was produced by a call to get_host_check_return_code() within the handle_async_host_check_result() routine before process_host_check_result() was called.
  • hst->plugin_output is also unchanged within process_host_check_result(), having been updated to the new plugin output by the call to parse_check_output() from within the handle_async_host_check_result() routine before that routine called process_host_check_result(). (That value may be overridden in the middle by the call to get_host_check_return_code(), before process_host_check_result() is called.)
  • temp_service->current_state was produced by a call to get_service_check_return_code() from within the handle_async_service_check_result() routine, just a bit before the PASSIVE SERVICE CHECK logging within that same routine.
  • temp_service->plugin_output was updated to the new plugin output by the call to parse_check_output() from within the handle_async_service_check_result() routine, just a bit before the PASSIVE SERVICE CHECK logging within that same routine. (That value may be overridden in the middle by the call to get_service_check_return_code(), before the logging occurs.)
In Nagios 4.4.6, the sequence of field updates and references is shuffled around relative to what happened in Nagios 4.3.4.
  • Within host_is_passive():

    Code: Select all

        661     /* log passive checks - we need to do this here as some may bypass external commands by getting dropped in checkresults dir */
        662     /* todo - check if current_state is right - i don't think it is! */
        663     if (log_passive_checks == TRUE) {
        664         logit(NSLOG_PASSIVE_CHECK, FALSE, "PASSIVE HOST CHECK: %s;%d;%s\n",
        665             hst->name,
        666             hst->current_state,
        667             hst->plugin_output);
        668     }
    
    That comment on line 662 turns out to be prescient.
  • Within service_is_passive():

    Code: Select all

        633     /* log passive checks - we need to do this here, as some may bypass external commands by getting dropped in checkresults dir */
        634     if (log_passive_checks == TRUE) {
        635         logit(NSLOG_PASSIVE_CHECK, FALSE, "PASSIVE SERVICE CHECK: %s;%s;%d;%s\n",
        636             svc->host_name,
        637             svc->description,
        638             svc->current_state,
        639             svc->plugin_output);
        640     }
    
In both of those blocks, we have state and plugin output data being logged, and it represents old existing data before the new passive check result is applied. Which makes these particular log messages hugely confusing both for diagnostic purposes, and if you have any code that uses such log messages to trigger some other actions.
  • The call to host_is_passive() within handle_async_host_check_result() is made before any of the check-result data is applied, so:
    • hst->current_state represents the existing state, which has not yet been updated as it will be within the call to host_initial_handling() that has not yet happened. That routine will set it from analysis within the get_host_check_return_code() routine.
    • hst->plugin_output represents the existing plugin output, which has not yet been updated as it will be within the call to host_initial_handling() that has not yet happened. That routine will set it by calling parse_check_output() to perform most of the detail work, with some cleanup afterward.
  • The call to service_is_passive() within handle_async_service_check_result() is made before any of the check-result data is applied, so:
    • svc->current_state represents the existing state, which has not yet been updated as it will be within the call to service_initial_handling() that has not yet happened. That routine will set it from analysis within the get_service_check_return_code() routine.
    • svc->plugin_output represents the existing plugin output, which has not yet been updated as it will be within the call to service_initial_handling() that has not yet happened. That routine will set it by calling parse_check_output() to perform most of the detail work, with some cleanup afterward.
The fixes for the bad logging are to move around a few lines of code, so the logged values once again properly reflect the new incoming data instead of the old data which is being replaced. Specifically:
  • Move lines 661-668 to immediately after the call to host_initial_handling() on line 2268 (in the untouched Nagios 4.4.6 code), deleting the "i don't think it is!" comment in the process, and enclosing the logging in an additional conditional that reflects the necessary constraint for this output:

    Code: Select all

        host_initial_handling(hst, cr, &old_plugin_output);
    
        /* log passive checks - we need to do this here, as some may bypass external commands by getting dropped in checkresults dir */
        if (hst->check_type == CHECK_TYPE_PASSIVE) {
            if (log_passive_checks == TRUE) {
                logit(NSLOG_PASSIVE_CHECK, FALSE, "PASSIVE HOST CHECK: %s;%d;%s\n",
                    hst->name,
                    hst->current_state,
                    hst->plugin_output);
            }
        }
    
  • Move lines 633-640 to immediately after the call to service_initial_handling() on line 1237 (in the untouched Nagios 4.4.6 code), enclosing the logging in an additional conditional that reflects the necessary constraint for this output:

    Code: Select all

        service_initial_handling(svc, cr, &old_plugin_output);
    
        /* log passive checks - we need to do this here, as some may bypass external commands by getting dropped in checkresults dir */
        if (svc->check_type == CHECK_TYPE_PASSIVE) {
            if (log_passive_checks == TRUE) {
                logit(NSLOG_PASSIVE_CHECK, FALSE, "PASSIVE SERVICE CHECK: %s;%s;%d;%s\n",
                    svc->host_name,
                    svc->description,
                    svc->current_state,
                    svc->plugin_output);
            }
        }
    
Locked