[Nagios-devel] nagiostats Bug with Active Service Checks
Posted: Mon Feb 23, 2009 10:47 pm
This is a multi-part message in MIME format.
--------------020006050509030106060809
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Hello,
During the course of a recent distributed deployment, I discovered a bug
in nagiostats (and possibly Nagios) that lead to misleading statistics
in certain situations.
In particular, I set things up so that every distributed server knew
about all of the service checks, but inherited several properties
(active_checks_enabled, notifications, etc) from a single configuration
file that was unique on each Nagios server. After initially loading up a
single monitoring host with a couple thousand service checks, I shuffled
them out to the other distributed hosts. This led to nagiostats
reporting insane numbers for the active check latency of the initially
loaded up host but realistic numbers for the other ones.
It appears that nagiostats uses check_type to determine whether to
process a service as though it is active, rather than
active_checks_enabled. This may well be fine if Nagios correctly reset
check_type after a configuration reload, but it doesn't appear to change
it.
It looked like, as I changed services to active_checks_enabled = 0, the
active service latency average went higher and higher. Looking in
status.dat, the recently disabled services (which, by the by, still had
an active check scheduled when they were switched to
active_checks_enabled=0) would eventually time out and have a massive
latency, which would be averaged in with the rest of the latencies.
This was specifically with Nagios 3.0.6, my apologies if this has been
fixed since the latest stable release.
The attached patch may be the correct answer or is may be a work around
for Nagios only setting check_type the first time a service is created
in status.dat. Either way, it was the quickest way for me to get more
accurate latency information, so I thought I'd share it along with the bug.
Feel free to let me know if there's any questions or if my diagnosis was
entirely wrong.
Thanks,
Tanner
--
Tanner Beck
The Linux Box
734.761.4689
--------------020006050509030106060809
Content-Type: text/plain;
name="nagiostats-active-service-check-check.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="nagiostats-active-service-check-check.diff"
diff -Nurp nagios-cvs/base/nagiostats.c nagios-cvs-modified/base/nagiostats.c
--- nagios-cvs/base/nagiostats.c 2008-12-20 12:17:23.000000000 -0500
+++ nagios-cvs-modified/base/nagiostats.c 2009-02-17 11:04:37.000000000 -0500
@@ -917,6 +917,7 @@ int read_status_file(void){
int downtime_depth=0;
time_t last_check=0L;
int should_be_scheduled=TRUE;
+ int active_checks_enabled=TRUE;
int has_been_checked=TRUE;
@@ -1082,7 +1083,7 @@ int read_status_file(void){
have_max_service_state_change=TRUE;
max_service_state_change=state_change;
}
- if(check_type==SERVICE_CHECK_ACTIVE){
+ if(check_type==SERVICE_CHECK_ACTIVE && active_checks_enabled==TRUE){
active_service_checks++;
average_active_service_latency=(((average_active_service_latency*((double)active_service_checks-1.0))+latency)/(double)active_service_checks);
if(have_min_active_service_latency==FALSE || min_active_service_latency>latency){
@@ -1193,6 +1194,7 @@ int read_status_file(void){
last_check=(time_t)0;
has_been_checked=FALSE;
should_be_scheduled=FALSE;
+ active_checks_enabled=FALSE;
}
@@ -1358,6 +1360,8 @@ int read_status_file(void){
has_been_checked=(atoi(val)>0)?TRUE:FALSE;
else if(!strcmp(var,"should_be_scheduled"))
should_be_scheduled=(atoi(val)>0)?TRUE:FALSE;
+ else if(!strcmp(var,"active_checks_enabled"))
+ active_checks_enabled=(atoi(val)>0)?TRUE:FALSE;
break;
default:
--------------020006050509030106060809--
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
--------------020006050509030106060809
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Hello,
During the course of a recent distributed deployment, I discovered a bug
in nagiostats (and possibly Nagios) that lead to misleading statistics
in certain situations.
In particular, I set things up so that every distributed server knew
about all of the service checks, but inherited several properties
(active_checks_enabled, notifications, etc) from a single configuration
file that was unique on each Nagios server. After initially loading up a
single monitoring host with a couple thousand service checks, I shuffled
them out to the other distributed hosts. This led to nagiostats
reporting insane numbers for the active check latency of the initially
loaded up host but realistic numbers for the other ones.
It appears that nagiostats uses check_type to determine whether to
process a service as though it is active, rather than
active_checks_enabled. This may well be fine if Nagios correctly reset
check_type after a configuration reload, but it doesn't appear to change
it.
It looked like, as I changed services to active_checks_enabled = 0, the
active service latency average went higher and higher. Looking in
status.dat, the recently disabled services (which, by the by, still had
an active check scheduled when they were switched to
active_checks_enabled=0) would eventually time out and have a massive
latency, which would be averaged in with the rest of the latencies.
This was specifically with Nagios 3.0.6, my apologies if this has been
fixed since the latest stable release.
The attached patch may be the correct answer or is may be a work around
for Nagios only setting check_type the first time a service is created
in status.dat. Either way, it was the quickest way for me to get more
accurate latency information, so I thought I'd share it along with the bug.
Feel free to let me know if there's any questions or if my diagnosis was
entirely wrong.
Thanks,
Tanner
--
Tanner Beck
The Linux Box
734.761.4689
--------------020006050509030106060809
Content-Type: text/plain;
name="nagiostats-active-service-check-check.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="nagiostats-active-service-check-check.diff"
diff -Nurp nagios-cvs/base/nagiostats.c nagios-cvs-modified/base/nagiostats.c
--- nagios-cvs/base/nagiostats.c 2008-12-20 12:17:23.000000000 -0500
+++ nagios-cvs-modified/base/nagiostats.c 2009-02-17 11:04:37.000000000 -0500
@@ -917,6 +917,7 @@ int read_status_file(void){
int downtime_depth=0;
time_t last_check=0L;
int should_be_scheduled=TRUE;
+ int active_checks_enabled=TRUE;
int has_been_checked=TRUE;
@@ -1082,7 +1083,7 @@ int read_status_file(void){
have_max_service_state_change=TRUE;
max_service_state_change=state_change;
}
- if(check_type==SERVICE_CHECK_ACTIVE){
+ if(check_type==SERVICE_CHECK_ACTIVE && active_checks_enabled==TRUE){
active_service_checks++;
average_active_service_latency=(((average_active_service_latency*((double)active_service_checks-1.0))+latency)/(double)active_service_checks);
if(have_min_active_service_latency==FALSE || min_active_service_latency>latency){
@@ -1193,6 +1194,7 @@ int read_status_file(void){
last_check=(time_t)0;
has_been_checked=FALSE;
should_be_scheduled=FALSE;
+ active_checks_enabled=FALSE;
}
@@ -1358,6 +1360,8 @@ int read_status_file(void){
has_been_checked=(atoi(val)>0)?TRUE:FALSE;
else if(!strcmp(var,"should_be_scheduled"))
should_be_scheduled=(atoi(val)>0)?TRUE:FALSE;
+ else if(!strcmp(var,"active_checks_enabled"))
+ active_checks_enabled=(atoi(val)>0)?TRUE:FALSE;
break;
default:
--------------020006050509030106060809--
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]