HOST retry_interval is being disregarded

op-team · Post by **op-team** » Tue Jan 30, 2018 10:33 am

Hi Guys,

We are running Nagiosxi 5.4.4 on a server CentOS 6.9

I would like to bring to your attention a strange behaviour we have noticed.
The duration between check attempts doesn't match the retry_interval

Please have a look to the retry timing of the following example:

[Tue Jan 30 08:51:48 2018] HOST ALERT: 1287060420_00_oneaccess-01;DOWN;SOFT;1;CRITICAL - X.X.X.X: rta nan, lost 100%
[Tue Jan 30 08:52:14 2018] HOST ALERT: 1287060420_00_oneaccess-01;DOWN;SOFT;2;CRITICAL - X.X.X.X: rta nan, lost 100%
[Tue Jan 30 08:53:12 2018] HOST ALERT: 1287060420_00_oneaccess-01;DOWN;SOFT;3;CRITICAL - X.X.X.X: rta nan, lost 100%
[Tue Jan 30 08:55:10 2018] HOST ALERT: 1287060420_00_oneaccess-01;DOWN;HARD;4;CRITICAL - X.X.X.X: rta nan, lost 100%
[Tue Jan 30 15:54:07 2018] HOST ALERT: 1287060420_00_oneaccess-01;UP;HARD;4;TEST retry_interval
[Tue Jan 30 15:55:50 2018] HOST ALERT: 1287060420_00_oneaccess-01;DOWN;SOFT;1;CRITICAL - X.X.X.X: rta nan, lost 100%
[Tue Jan 30 15:56:15 2018] HOST ALERT: 1287060420_00_oneaccess-01;DOWN;SOFT;2;CRITICAL - X.X.X.X: rta nan, lost 100%
[Tue Jan 30 15:57:35 2018] HOST ALERT: 1287060420_00_oneaccess-01;DOWN;SOFT;3;CRITICAL - X.X.X.X: rta nan, lost 100%
[Tue Jan 30 15:59:36 2018] HOST ALERT: 1287060420_00_oneaccess-01;DOWN;HARD;4;CRITICAL - X.X.X.X: rta nan, lost 100%

While the in the host configuration "retry_interval = 3min"
define host {
name generic-host
max_check_attempts 4
check_interval 5
retry_interval 3
active_checks_enabled 1
passive_checks_enabled 1
check_period 24x7
check_freshness 1
event_handler_enabled 1
flap_detection_enabled 1
flap_detection_options o,u,
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
contact_groups MGMT-PBX
notification_interval 0
notification_period 24x7
first_notification_delay 0
notification_options d,
notifications_enabled 1
register 0
}

Thanks in advance for your quick reply

B.Regards

dwhitfield · Post by **dwhitfield** » Tue Jan 30, 2018 1:36 pm

I can't really tell much from that config with the host name blocked out. Can you PM me your Profile? If you'd prefer, you can open a ticket at https://support.nagios.com/tickets/

If you can PM it, you can download it by going to Admin > System Config > System Profile and click the ***Download Profile*** button towards the top. If for whatever reason you *cannot* download the profile, please put the output of View System Info (5.3.4+, Show Profile if older) in the thread (that will at least get us some info). This will give us access to many of the logs we would otherwise ask for individually. If security is a concern, you can unzip the profile take out what you like, and then zip it up again. We may end up needing something you remove, but we can ask for that specifically.

You can also generate a profile manually using the script at /usr/local/nagiosxi/html/includes/components/profile/getprofile.sh

That should generate a profile in /usr/local/nagiosxi/var/components/ which you can get off the server with an application such as FileZilla.

After you PM the profile, please update this thread. Updating this thread is the only way for it to show back up on our dashboard.

If you get an error that PROFILE BUILD FAILED, please see https://support.nagios.com/kb/article.p ... ategory=44

It's also possible you have multiple nagios parent processes. Please run through the following in order and let me know if you run into any issues.

NOTE: You ***must*** use mariadb instead of mysqld in the commands below, ***if*** you have mariadb.
# service nagios stop
# service ndo2db stop
# service mysqld stop
# service crond stop
# service httpd stop
# killall -9 nagios
# killall -9 ndo2db
# rm -f /usr/local/nagios/var/rw/nagios.cmd
# rm -f /usr/local/nagios/var/nagios.lock
# rm -f /usr/local/nagios/var/ndo.sock
# rm -f /usr/local/nagios/var/ndo2db.lock
# rm -f /usr/local/nagiosxi/var/reconfigure_nagios.lock
# for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
# service mysqld start
# service ndo2db start
# service nagios start
# service httpd start
# service crond start

op-team · Post by **op-team** » Wed Jan 31, 2018 5:19 am

Hi,

My profile file

profile.zip

I have run the commands you suggested me without solving the problem.
[root@nagios: ~]# ipcs -q

------ Message Queues --------
key msqid owner perms used-bytes messages
0x7e000002 5341184 nagios 600 0 0

Look at the time "Next Check:" in the capture below

Capture1.PNG

one minute later, the next check time changed as you can see below:

Capture2.PNG

B.Regards

dwhitfield · Post by **dwhitfield** » Wed Jan 31, 2018 12:28 pm

I would really have anticipated it being less than 3 minutes, but it's never going to be exact. I would suggest installing a ramdisk via the instructions at https://assets.nagios.com/downloads/nag ... giosXI.pdf. If getting the load down doesn't resolve the issue, then I think you may just have to wait until there is a new version of Core and the performance improvements coming in XI 5.5. As of the moment, we are not planning a 5.4.13. If we hear more verifiable issues that could change, but in this case you don't appear to be missing notifications, so I wouldn't say this is a major issue.

Another thing to try would be to run through https://assets.nagios.com/downloads/nag ... tabase.pdf but given the load on the server, I suspect any database fix would be temporary. However, running a repair and installing the ramdisk may be enough.

Nagios Support Forum

HOST retry_interval is being disregarded

HOST retry_interval is being disregarded

Re: HOST retry_interval is being disregarded

Re: HOST retry_interval is being disregarded

Re: HOST retry_interval is being disregarded