Page 2 of 3
Re: livestatus not picking up comments after nagios restart
Posted: Wed Feb 27, 2013 4:15 pm
by GavinG
scottwilkerson wrote:I'm going to have our Core developer take a look at this....
Cheers. I've been doing a fair bit of thinking about this in the last few days - and have tried a patch which doesn't rely on next_comment_id and instead sets this number based on the highest comment_id found in the retention.dat file + 1 (
http://marc.info/?l=nagios-devel&m=121251227708074). I think that would work but it's not an elegant solution. If you push a lot of comments through your system then the ~4.29 billion unsigned long int comment_id could potentially be used up (unlikely, but possible).
Looking at the nagios-3.4.4/common/comments.c file early this morning (seriously, why do I think best at 1am) I found this bit of code:
Code: Select all
comment *find_comment(unsigned long comment_id, int comment_type) {
comment *temp_comment = NULL;
for(temp_comment = comment_list; temp_comment != NULL; temp_comment = temp_comment->next) {
if(temp_comment->comment_id == comment_id && temp_comment->comment_type == comment_type)
return temp_comment;
}
return NULL;
}
It looks for the next unused host or service comment_id based on the comment_type (1 for host, 2 for service). This allows for duplicate comment_id's globally while maintaining them as unique within the host and service comment lists. I don't necessarily see this as a bad idea even though if you add a comment to a host or service it always seems to increase the next_comment_id value by 1.
I've thought that to maintain global uniqueness then changing the above code to this may be a better more elegant solution for my immediate problem:
Code: Select all
comment *find_comment(unsigned long comment_id, int comment_type) {
comment *temp_comment = NULL;
for(temp_comment = comment_list; temp_comment != NULL; temp_comment = temp_comment->next) {
if(temp_comment->comment_id == comment_id)
return temp_comment;
}
return NULL;
}
If my limited C/C++ (but ability to follow/read code .. mostly) understanding is right, then this will stop it looking for a unique comment_id within the hostcomments or servicecomments lists and instead do it globally over all comments. If a comment exists with that comment_id anywhere in the system then it will keep searching.
In the long term I am now happy that reusing unused comment_id's is actually a good thing and that this is probably a Livestatus bug with it not handing duplicate comment_id's and considering them as a globally unique identifier. I'll keep following up with them - and use the update I posted here in the meantime assuming I don't find any problems with it (I maintain our own internal Nagios packages so it's not a big deal to implement).
Re: livestatus not picking up comments after nagios restart
Posted: Thu Feb 28, 2013 5:31 pm
by abrist
If our C developer is in tomorrow, I will bring this post to his attention.
Re: livestatus not picking up comments after nagios restart
Posted: Tue Mar 05, 2013 2:51 pm
by GavinG
abrist wrote:If our C developer is in tomorrow, I will bring this post to his attention.
Sorry, I don't mean to sound like I'm hassling, just wondering if there was any news.
We were
lucky and got to try our patched version in production (had to fail over from prod to DR due to a network fault) and my small patch didn't seem to cause any issues so we're going to roll it to production early next week.

Re: livestatus not picking up comments after nagios restart
Posted: Tue Mar 05, 2013 5:22 pm
by abrist
He was made aware of the issue, so I assume he will look at it if he has not already. He does pay very close attention to the core bug tracker:
http://tracker.nagios.org
As you have a patch and identified the issue, posting on the tracker and linking back to this thread should get the comment issue some attention.
Re: livestatus not picking up comments after nagios restart
Posted: Tue Mar 05, 2013 6:35 pm
by GavinG
Ah ok, cheers. My patch is only a temporary hacky work around for Livestatus on my end rather than something I consider needing to be fixed in Nagios which is why I don't really want to submit a bug report as I'm about 99.99% sure that Nagios isn't doing anything wrong by not treating comment_id as a globally unique identifier but rather keeping it unique within the host or service comment lists.
I guess all I'm looking for is confirmation that "Yes, this is by design. If product X doesn't handle it then it's a fault with product X" which will let me go back to the Livestatus guys with proof that they can't treat comment_id's as a globally unique identifier.

Re: livestatus not picking up comments after nagios restart
Posted: Wed Mar 06, 2013 12:18 pm
by abrist
As far my understanding of issue is concerned, and my discussion with estanley last Friday, this is by design in Nagios. The comment_id was never meant to be global, but only within the scope of services or hosts. So it may be high time to contact the mklivestatus community for clarification and a possible bug report. Feel free to link to this thread and please post the link to your bug report/discussion from mk's community here in this thread for future forum-goers.
Good sleuthing good sir, and Cheers.
Re: livestatus not picking up comments after nagios restart
Posted: Wed Mar 06, 2013 3:22 pm
by GavinG
Brilliant. Thanks muchly.
I suspect that as Mathias is at CeBit he probably hasn't had time to look at my emailed bug report yet (the mailing list, at least the English version, looked more a user support list rather than an interact with the devs list so wasn't much help at all). I'll keep following it up with them and hope it gets picked up and fixed for the new 1.2.2 release (which is apparently very close).
Thanks again.
Re: livestatus not picking up comments after nagios restart
Posted: Wed Mar 06, 2013 3:53 pm
by abrist
You are most welcome. We will stay tuned.
Re: livestatus not picking up comments after nagios restart
Posted: Fri Mar 08, 2013 8:25 am
by scottwilkerson
Our Core developer had a chance to look at this, his comments go as follows:
estanley wrote:There is definitely only one comment ID space, so a host and a service should never share a comment ID. Other than a rollover of the comment IDs at 2^32, there is only one case I can think of where you might get duplicate comment IDs and that is if core is restarted in the absence of the retention.dat file or the next_comment_id was not in the retention.dat file.
The retention data is read before the function xcddefault_initialize_comment_data() function is called. Only in the case where the next_comment_id was not set when the retention data was read would xcddefault_initialize_comment_data() do anything. As for the unused main_config_file parameter, I can only guess that maybe this function was at one time intended to be called through a function pointer and it is passed because there are (or were going to be) other initialization functions that would need the main configuration.
In the rest of the code, the next_comment_id is only ever incremented and it is specifically incremented after adding both host and service comments.
One caveat here, these conclusions come from reading the code, not executing it. The code is pretty straightforward so I don't think I've missed anything.
One possible solution to the poster's issue might be to shutdown core, manually edit the retention.dat file to set the next_comment_id above his highest known comment ID and restart. I would call that an unsupported solution, but it's worth a shot.
Hope that helps.
Re: livestatus not picking up comments after nagios restart
Posted: Fri Mar 08, 2013 4:52 pm
by GavinG
Hmmmm ... more to think about.
Just to respond to the "reset the next_comment_id", we have and once restarted the status.dat file shows that number .. but some process still resets it. More on this in the post.
It sounds like there are possibly a few things that could be the issue here then.
- Option 1: next_comment_id being reset back to 1. This is happening for us - we haven't been able to figure out why yet.
- Option 2: *find_comment from common/comments.cc is finding the next available comment_id to use only within the scope of hostcomments or servicecomments. I think this is what happens right now but could be compromised due to #1.
- Option 3: *find_comment from common/comments.cc should be finding the next available comment_id to use globally within all comments within the scope of hostcomments and servicecomments. I think this is what should be happening according to estanley's reply but could be compromised due to #1.
My random thoughts as of 10:51a with only 1 coffee in me:
- Option 1: Suggests a Nagios issue to me. If this is the issue then there is something resetting next_comment_id back to 1 (or a low number). We have Nagios saving out the retention.dat file and we don't lose anything (host/service statuses) through restarts so I don't think we have any issues there. We've checked through restarts and the next_comment_id isn't changing - although it may be low due to whatever has happened to cause it to reset.
Things that it could be:
- Could it be related to the nightly rotating of log files. I'll try to log into our problem server tonight, stop Nagios, reset next_comment_id to the right number in the retention.dat file, start Nagios then watch what happens after midnight. I need to do this in the next 2 nights as we're installing my patched 3.4.4 (which removes the comment_type restriction in common/comments.cc: *find_comment as posted in http://support.nagios.com/forum/posting ... e3#pr46686) on Monday.
- Could it be related to what is happening in option 2 - that is we're finding a low comment_id for the new comment we're adding which in turn is setting next_comment_id back to a low number?
- Option 2: Suggests a Livestatus issue to me. If this is the issue (duplicate comment_id's are OK only within the scope of hostcomments or servicecomments, not globally) then reusing comment_id's is valid and OK. As earlier, next_comment_id being reset is possibly compromising this and once we fix that this may not be an issue anymore.
- Option 3: Suggests a Nagios issue to me. If this is the issue (duplicate comment_id's are not OK and must be unique globally within the scope of hostcomments and servicecomments) then reusing comment_id's is bad and my patch is probably the right fix. As earlier, next_comment_id being reset is possibly compromising this and once we fix that this may not be an issue anymore.
Maybe we need to ignore next_comment_id from the retention.dat file and instead calculate it based on highest comment_id+1 (like
http://marc.info/?l=nagios-devel&m=121251227708074 except they don't do the +1).
I'll do the reset tonight and post results tomorrow (I'm in NZ so tomorrow is still a fair way off).
[edit] I did the next_comment_id update just now, but in the process I
may have found what our problem was .. and am going to kick myself if this is what's causing it. I won't know until tomorrow though. If it is this particular thing it does suggest a possible Nagios bug elsewhere in the code so watch this space.
Really appreciate the responses and time to help us.