nagios dies - sometimes

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: nagios dies - sometimes

Post by dwhitfield »

Let us know how it goes!
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: nagios dies - sometimes

Post by MalcolmPreen »

OK... just when we thought we were out of the trees.... (no reproductions since xmas). We had a "day time" one today....

Nothing to do with thruk scheduled downtimes....

BUT - one of my colleagues had been working on a data centre, so approximately 116 hosts and 1100 services were in downtime.
The work had gone on for longer than planned, so the downtimes overlapped with each other [don't know if that matters, but it was also true for the thruk downtimes].
.
The first downtime was until 14:00 on 11th Jan.... which was then extended to 18:00 on 11th Jan (the original downtime was not cancelled)
This was then extended to 14:00 on 12th Jan (today) - again overlapping, and no cancellation.
.
These downtimes were set programmatic ally (we send an e-mail, identifying a host group and start/end time... and the programmatic download command is used [the command used is; SCHEDULE_HOSTGROUP_HOST_DOWNTIME;$target;$startnum;$endnum;1;0;$duration;$sender;$comment ].
.
What happened today at 14:00 (the return of the hosts/services to regular operation), is that nagios died (as appears to happen overnight), and the last messages in /usr/local/nagios/var/nagios.log are multiple host returning from downtime messages.
.
All is now recovered, but this would appear to rule out thruk... and rule in "downtime" into the cause...
.
I've seen failures when multiple downtimes are entered.... or exited...
.
So.... is there any reason why we should not have, for example;
.
host1 in downtime between 10:00 and 14:00, 13:00 and 18:00 and 17:00 and 14:00 (next day)
ideally, this would be one downtime between 10:00 and 14:00 (next day) - but hindsight is a wonderful thing !
.
Any suggestions bearing in mind this new information ?
.
Malcolm
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: nagios dies - sometimes

Post by rkennedy »

At this point it doesn't seem like we're working with Nagios Core, but much rather with Thruk. We do not have control of how Core will interact with other products. If you're looking for support with it, please look at contacting them here - https://www.thruk.org/support.html

Our support is here to help with Nagios, it is fairly hard to support other products to which aren't our code. Core is open source so developers are able to do with it what they'd like, but we're limited on the support side.

If you are able to show us the bug on a clean Core system, without Thruk involved, then please post as much information as possible so we can attempt to re-create this on our end.
Former Nagios Employee
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: nagios dies - sometimes

Post by MalcolmPreen »

Did you read my recent note ? Whilst thruk is installed, (and can't be uninstalled as people use it), the problem seems to be explicitly related to scheduled downtime (either the start or the end, and typically multiple downtimes, possibly overlapping).
.
Whilst I could set-up a "non-thruk" system - the chances of reproducing the problem (over 100 hosts and over 1000 services) is going to take some considerable effort. I'll see what I can do, but don't hold your breath.
.
I hope you will re-consider your response - as this feels to me (as a support engineer) like a "not my problem" response, which I hate....
.
If this is the stance to be taken by "nagios support" - then fine.... that's your call.... at the end of the day, we are aware of the issue, and can work around it, but as it appears that I have found a bug, that potentially any relatively large set-up could encounter, I would have hoped that you would be interested in fixing the problem (which was the impression I got from dwhitfield).
.
For the record, I had opened a call with thruk... who took the same attitude.... "its probably nagios core"
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: nagios dies - sometimes

Post by tmcdonald »

We've not seen this behavior on other Core-only installs, regardless of the size of the installation. We also have not seen it in XI installs. The only differences you have noted are Thruk and (although we did not discuss it much) nsca-ng: https://github.com/weiss/nsca-ng

Neither of those are maintained by Nagios, and could potentially be causing issues in your install.

If you can generate a core dump and send it along with the nagios binary (plus those of any modules in use) then I will take a look or have our Core dev look into it, but beyond that there is just not enough information to go off considering we are not able to replicate this internally.
Former Nagios employee
Locked