Page 1 of 2

Retry Interval behaviour after upgrade to 4.4.1

Posted: Wed Aug 01, 2018 10:54 pm
by tonymcg27
Hi Guys,

Most of my services are setup to run every 10 minutes, with a retry interval of 1 minute.
But after upgrading from 4.3.4 to 4.4.1, when a service detects the first soft state change, the next check is scheduled 10 minutes later, not 1 minute later.
I didn't make any changes to my nagios.cfg file during the upgrade, so maybe it needs tweaking, but I didn't see anything in the release notes that suggested that a change was necessary.
I'm just wondering if anyone else has noticed this behaviour? Is there a bug in 4.4.1, or in my cfg file(s)?

Cheers from Down Under,
Tony

Re: Retry Interval behaviour after upgrade to 4.4.1

Posted: Thu Aug 02, 2018 7:24 am
by scottwilkerson
This is a known big and is fixed in the maint branch

I believe I found the cause in Core and is fixed in the maint branch on Github
https://github.com/NagiosEnterprises/na ... ee/maint​​

Re: Retry Interval behaviour after upgrade to 4.4.1

Posted: Thu Aug 02, 2018 7:38 pm
by tonymcg27
Thanks Scott, for your quick response and resolution. I have installed the "maint" release from GitHub and the check intervals are fine now.
But I've spotted another issue, it seems that most of the "SOFT;2" records are missing from the nagios.log file. The only time I see a SOFT;2 record is when the state of the service changes, e.g. if the SOFT;1 record is a WARNING and the SOFT;2 record is a CRITICAL. It's not a big issue, so no hurry. Is that a known issue too?
Thanks again.

Re: Retry Interval behaviour after upgrade to 4.4.1

Posted: Thu Aug 02, 2018 10:31 pm
by scottwilkerson
There was one more commit tonight to the maint branch that I believe fixes this as well

Re: Retry Interval behaviour after upgrade to 4.4.1

Posted: Sun Aug 05, 2018 7:10 pm
by tonymcg27
Hi Scott, thanks again for the quick reply. I have installed the latest "maint" code but it doesn't seem to fix the logging issue. And I've also noticed that when the checks recover I am not getting a Recovery Notification.

Re: Retry Interval behaviour after upgrade to 4.4.1

Posted: Mon Aug 06, 2018 7:15 am
by scottwilkerson
I should have noted, the services that were stuck in the soft state will need to go into an ok state before they will act normally, this can either be natural, or by sending an ok passive check, or to to them all in one go, removing the retention.dat with the following

Code: Select all

service nagios stop
rm -f /usr/local/nagios/var/retention.dat
service nagios start
The above will make all the checks go into a pending state until they receive their first check result.

Re: Retry Interval behaviour after upgrade to 4.4.1

Posted: Wed Aug 08, 2018 11:40 pm
by tonymcg27
Sorry Scott, but this still isn't working. Even after removing the retention.dat file and starting afresh, it's the same behaviour. Missing SOFT;2 records from nagios.log, and no recovery notifications.
I then rolled back to v4.3.4, but using the same files from etc and var, and it works just fine.
I have setup symlinks for the etc and var directories that point to a shared directory to make it easy to flip between versions, so I hope that's not mucking things up, i.e.

Code: Select all

[root@nagios local]# ll -d /usr/local/nagios*
lrwxrwxrwx  1 root root         9 Aug  8 14:05 /usr/local/nagios -> nagios441B
drwxr-xr-x  7 root root      4096 Aug  2 11:54 /usr/local/nagios434
drwxr-xr-x 10 root root      4096 Sep  9  2014 /usr/local/nagios407
drwxr-xr-x  7 root root      4096 Aug  2 11:55 /usr/local/nagios441
drwxr-xr-x  7 root root      4096 Aug  3 09:15 /usr/local/nagios441B        <--- the "maint" release
drwxr-xr-x  4 root root      4096 Aug  2 11:46 /usr/local/nagioscommon

[root@nagios local]# ll  /usr/local/nagios441B
total 20
drwxrwxr-x  2 nagios nagios 4096 Aug  6 09:40 bin
lrwxrwxrwx  1 root   root     27 Aug  3 09:15 etc -> /usr/local/nagioscommon/etc
drwxr-xr-x  2 root   root   4096 Aug  3 09:14 include
drwxrwxr-x  2 nagios nagios 4096 Aug  3 09:14 libexec
drwxrwxr-x  2 nagios nagios 4096 Aug  6 09:40 sbin
drwxrwxr-x 15 nagios nagios 4096 Aug  6 09:40 share
lrwxrwxrwx  1 root   root     27 Aug  3 09:15 var -> /usr/local/nagioscommon/var

Re: Retry Interval behaviour after upgrade to 4.4.1

Posted: Thu Aug 09, 2018 8:10 am
by scottwilkerson
I hadn't caught this in the first change, one more commit to the maint branch was made this morning that I tested fixes the logging on SOFT states > 1

Re: Retry Interval behaviour after upgrade to 4.4.1

Posted: Thu Aug 09, 2018 8:58 pm
by tonymcg27
Woohoo, it works!!!
Thanks Scott, for putting up with my nagging :)

Re: Retry Interval behaviour after upgrade to 4.4.1

Posted: Fri Aug 10, 2018 8:42 am
by scottwilkerson
tonymcg27 wrote:Woohoo, it works!!!
Thanks Scott, for putting up with my nagging :)
No problem, thanks for assisting in finding the bug!