Page 2 of 2

Re: Inconsistent NRDP performance

Posted: Wed Mar 14, 2018 2:42 pm
by cdienger
Sounds like realtime data isn't getting updated. This could be due to multiple nagios processes or caching. Are there multiple Nagios processes running on any of the machines? "ps -ef | grep nagios.cfg" should only show to processes similar to:

nagios 11232 1 3 12:27 ? 00:07:33 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 11276 11232 0 12:27 ? 00:00:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg


Anything more should be killed. The caching option can be found under Admin > System Config > Performance Settings > Backend Cache.

Re: Inconsistent NRDP performance

Posted: Wed Mar 14, 2018 3:23 pm
by krutaw
cdienger wrote:Sounds like realtime data isn't getting updated. This could be due to multiple nagios processes or caching. Are there multiple Nagios processes running on any of the machines? "ps -ef | grep nagios.cfg" should only show to processes similar to:

nagios 11232 1 3 12:27 ? 00:07:33 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 11276 11232 0 12:27 ? 00:00:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg


Anything more should be killed. The caching option can be found under Admin > System Config > Performance Settings > Backend Cache.
I checked all 3 of the active nagios servers and there are indeed only 2 processes with nagios.cfg in the command line and the same was true of the passive server.

Also, I checked and the caching settings are not enabled in any of the servers so that's not what is driving this. Where else should I be looking at this point?

Re: Inconsistent NRDP performance

Posted: Wed Mar 14, 2018 3:56 pm
by krutaw
I think I found something, but I'm not sure what the heck to make of it. I happened to spot these lines in my nagios.log:

Code: Select all

[1521057945] Warning: The results of service 'Datastore - Usage' on host 'some_host_name' are stale by 0d 0h 0m 32s (threshold=0d 0h 25m 0s).  I'm forcing an immediate check of the service.
What's interesting about that is that the Nagios passive server sees that I've set the freshness threshold to 25 minutes, but is considering the checks stale after mere seconds. And it's not just that one check, the log was literally littered with them timing out after as little as 1 second. As you can see by the log output, I've set the freshness threshold but it's being ignored. Thoughts?

Oh, and in case it helps, I looked at the settings for one of the failing checks in objects.cache and it looks like this:

Code: Select all

define service {
        host_name	some_host_name
        service_description     Datastore - Usage
        check_period    xi_timeperiod_24x7
        check_command   check_dummy!2!"Data not received from $_HOSTNAGHOST$"!!!!!!
        contacts        nagiosadmin
        notification_period     xi_timeperiod_24x7
        initial_state   o
        importance	0
        check_interval  10.000000
        retry_interval  1.000000
        max_check_attempts	5
        is_volatile     0
        parallelize_check	1
        active_checks_enabled   1
        passive_checks_enabled  1
        obsess  1
        event_handler_enabled   1
        low_flap_threshold	0.000000
        high_flap_threshold     0.000000
        flap_detection_enabled  1
        flap_detection_options  a
        freshness_threshold     1500
        check_freshness 1
        notification_options    a
        notifications_enabled   0
        notification_interval   60.000000
        first_notification_delay        0.000000
        stalking_options        n
        process_perf_data	0
        retain_status_information	1
        retain_nonstatus_information    1
        }

Re: Inconsistent NRDP performance

Posted: Thu Mar 15, 2018 11:32 am
by cdienger
I'm curious about the testing method with the log - the message would indicate that the check hasn't come in for threshold+stalevalue or 25minutes 32 seconds in this case. That would be inline with the behavior we've been seeing but on the other hand if the check is going stale BEFORE the threshold of 25 minutes is reached, that would be a problem.

If the above doesn't help you with finding the problem, I'd like to take a look at the systems on a remote and would request you open a ticket at http://support.nagios.com/tickets/ in that case.