bug fix for avail.cgi
Posted: Mon Mar 07, 2016 6:09 am
We experienced a problem with avail.cgi on a 64-bit platform. The fix is provided in the attached avail.c.saved_stamp.patch file. It applies directly to Nagios 3.5.0 and Nagios 3.5.1, and still applies as well (but with some fuzz/offset) to Nagios 4.0.8 and Nagios 4.1.1. Which is to say, while we found the bug in Nagios 3.5.1, it is still present in Nagios 4.1.1.
There are two parts to the bug fix. One, saved_stamp is declared as int when it should really be a time_t. This won't show up as a practical effect until the year 2038, but it ought to be fixed anyway.
The other part is more immediately serious. Within the compute_subject_downtime_times routine, saved_stamp is assigned to some value derived from the incoming archived log data. Trouble is, in one branch of the code it is properly limited to never be less than start_time, while in the other branch immediately below that, the same correction is missing.
We found this problem by running into it at a customer site, using rather large archived data files. Thus I don't have an exact characterization of what input data triggers the fault, nor any simple test data I can post here. But a look at the affected code will show the obvious parallelism and the obvious applicability of the patch.
The result of the unfixed code is that some time intervals are being restricted to the time interval specified on the command line, while some time intervals are not. Then when it comes to computing certain values such as TIME_OK_UNSCHEDULED, which is calculated on the fly as temp_subject->time_ok - temp_subject->scheduled_time_ok, the subtraction yields a negative number. This is then output using %lu (reflecting the use of unsigned 64-bit integers), resulting in a huge positive value such as 18446744073709465216 (which is equal to (2^^64) - 86400). This mistake can cause a lot of downstream confusion, depending on what is done with the output of the avail.cgi program.
There are two parts to the bug fix. One, saved_stamp is declared as int when it should really be a time_t. This won't show up as a practical effect until the year 2038, but it ought to be fixed anyway.
The other part is more immediately serious. Within the compute_subject_downtime_times routine, saved_stamp is assigned to some value derived from the incoming archived log data. Trouble is, in one branch of the code it is properly limited to never be less than start_time, while in the other branch immediately below that, the same correction is missing.
We found this problem by running into it at a customer site, using rather large archived data files. Thus I don't have an exact characterization of what input data triggers the fault, nor any simple test data I can post here. But a look at the affected code will show the obvious parallelism and the obvious applicability of the patch.
The result of the unfixed code is that some time intervals are being restricted to the time interval specified on the command line, while some time intervals are not. Then when it comes to computing certain values such as TIME_OK_UNSCHEDULED, which is calculated on the fly as temp_subject->time_ok - temp_subject->scheduled_time_ok, the subtraction yields a negative number. This is then output using %lu (reflecting the use of unsigned 64-bit integers), resulting in a huge positive value such as 18446744073709465216 (which is equal to (2^^64) - 86400). This mistake can cause a lot of downstream confusion, depending on what is done with the output of the avail.cgi program.