nagios dies - sometimes

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: nagios dies - sometimes

Post by dwhitfield »

Did you end up upgrading to 4.2.2? If we end up finding a bug, I just want to make sure it is in the latest version. Thanks!
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: nagios dies - sometimes

Post by MalcolmPreen »

I wanted to prove I could re-produce it, at will, before upgrading.
.
The plan is, at present;
.
- re-produce again tonight
- upgrade tomorrow
- attempt to re-produce tomorrow
- report back
.
Hope that makes sense,
.
Malcolm
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: nagios dies - sometimes

Post by dwhitfield »

Sounds good. Please be aware that Thursday is Thanksgiving in the US, and we will be closing early on Wednesday, and will not be open Thursday or Friday. Of course, the community can still contribute on the forums.
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: nagios dies - sometimes

Post by MalcolmPreen »

Update:

no reproduction on nagios 4.2.1 on Tuesday 22nd
needed to upgrade to 4.2.2 on Tuesday - or it would be next month....
after upgrade... no reproduction as yet...
.
I've increased the attempted reproduction from;
4 lots of 10 attempts (40 total over 4 minutes) to
8 lots of 10 attempts (80 total over 8 minutes) to
8 lots of 20 attempts (160 total over 8 minutes)
.
Currently no reproduction;
.
So... it MIGHT be fixed.... but it also might be a timing issue....
.
I've left it at 8*20 over the weekend.... and will re-check on Monday....
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: nagios dies - sometimes

Post by dwhitfield »

I think it's at least Monday everywhere in the world. How are things looking?
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: nagios dies - sometimes

Post by MalcolmPreen »

Still failing to reproduce, despite increasing the "attempts" to 40 a minute for 8 minutes...

I'll leave this in place, and continue to monitor for a month or so...

If it makes it to the New Year, I suspect we can say that the nagios update has fixed it....

Malcolm
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: nagios dies - sometimes

Post by dwhitfield »

I know it's technically incomplete news, but it sounds like good news to me. We'll keep the thread open and await news in 2017!
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: nagios dies - sometimes

Post by MalcolmPreen »

I'd agree.... but.... we had a failure again last night...

It was slightly different, in that the "bigger hammer" resulted in the whole system dying - but checking the logs, I see 13 "set downtime" commands - out of the 320 scheduled.... so I suspect it is related.

So, I've reduced the hammer, back to 10 per minute for 8 minutes... and I need to be patient for a "clean" failure (or a non-failure)

Sorry for the less than positive news....

Still action with me...
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: nagios dies - sometimes

Post by dwhitfield »

While we wait for the error, I'm also going to send this thread to our Core dev to see if he thinks there are any logs or traces he wants us to get in case there is a bug here.

UPDATE: John suggested getting a core (lower case c) dump. How to do this depends on a few factors, but if you don't know of to do it, you may find https://stackoverflow.com/questions/179 ... tion-fault useful.
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: nagios dies - sometimes

Post by MalcolmPreen »

OK, set-up core file generation, and have tested this by running;

sleep 300 &
kill -ABRT [pid of sleep above]

which creates core.pid

Fingers crossed this will help if we need it.

Malcolm
Locked