nagios dies - sometimes

dwhitfield · Post by **dwhitfield** » Mon Oct 10, 2016 10:23 am

No problem! We won't close the thread until there is a resolution.

MalcolmPreen · Post by **MalcolmPreen** » Mon Nov 07, 2016 10:45 am

OK - I've got a reproduction of the problem....

First good news, my detection of the failure worked.... so I was able to auto-start (having collected debug data).

The failure appears to happen at 00:50 - which matches every time I've looked before....

However, even in 50 minutes, the amount of data logged in /var/log/messages and /usr/local/nagios/var/nagios.log is huge....

I've ploughed through the logs, and whilst I don't have a definitive cause yet.... I do have a suspicion.

One of my colleagues installed thruk on this system - and a few people use this interface.

What I have found, is that there is an apache cron file... with "this is thruk - do not edit" all over it....

As I don't know thruk.... I've gone in with the sledgehammer - and edited the cron file and restarted crond

Basically it was putting 8 hosts into downtime.... I don't know why... and it was set-up in September 2015 by someone unknown (but I suspect it is the guy who left !!).

It does this every day at 00:50.... which rings a bell....

I'm guessing it might be an overload thing... as we get MANY nsca messages all the time (hence the logs being large)... and maybe this is why "sometimes" nagios dies...

The thruk entries were all sending output to /dev/null.... so I've disabled all but one.... and re-directed this output to a real file... so I can look into it.

Long term, I suspect I'll delete this completely.... but the problem is... you can't prove a negative.... so I'll have to wait at least a couple of months to confirm that this was the problem....

Any thoughts on thruk / nagios ?
Does the above diagnosis / plan make sense ?

More shortly,
Malcolm

dwhitfield · Post by **dwhitfield** » Mon Nov 07, 2016 11:52 am

Could you post (and scrub as appropriate) or PM the logs your find suspicious? I have no reason to doubt you, but since there is going to be a waiting game, we might as well go ahead and collect some data in case this ends up not being the issue.

Also, for what it's worth, Core 4.2.2. is now out.

MalcolmPreen · Post by **MalcolmPreen** » Tue Nov 08, 2016 11:42 am

OK, below I have attached three files;
.
in order to protect the anonymity of the customer, I have changed the host name to hostX from the full name... where X is a number from the original host name
.

obfuscated_apache.cron.txt: apache cron file; (1.16 KiB) Downloaded 223 times

- is the file /var/spool/cron/apache - which is generated / modified by thruk
.

obfuscated_normal_nagios.log: "normal" output in nagios.log; (2.13 KiB) Downloaded 218 times

- is the appropriate sections of nagios.log on a day when there was no failure
.

obfuscated_nagios.log: nagios.log file at the time of "failure"; (861 Bytes) Downloaded 259 times

- is the very end of the nagios.log file when nagios failed (there was no nagios -d daemon process running)
.
Until yesterday, I was unaware that it was possible to do this with thruk - and given that this was implemented in September 2015 - it is likely that it was done by a colleague who has now left the company. The reason this was done, I believe, was that these systems were rebooted at 01:00 each day - and this colleague wanted to stop the "host down" notifications - not realising that we had ways within nagios core to prevent notifications being sent.
.
What I have done at present, is to remove all but one of these entries.... which is only being kept as a "placeholder" whilst I investigate further;
.
I am on holiday at the end of this week, but my plan is to "flood" the system early next week - with perhaps 100s of entries of thruk downtime - to give a better chance of "forcing" the error - as I believe it happens only if the "downtime" clashes with a heavy load of NSCA messages - which are unpredictable.
.
If I can "force" the error - then perhaps more diagnostics can be run....
- I would love to get an error from nagios (why did it terminate??)
.
I have the nagios.debug file from the time of the failure - but this shows no information about the failure (settings for debug as advised earlier in this thread).
.
In the long term.... I "think" I can avoid the problem... by disabling the thruk downtime.... but as before.... when can I be sure this is true?
.
This (running nagios with multiple nsca inputs and thruk... along with thruk scheduled downtime) probably explains why no-one else has seen this...
.
So, if I can "force" the error.... is there anyway of finding out why / how the nagios core daemon has died ?

avandemore · Post by **avandemore** » Tue Nov 08, 2016 11:56 am

We do not support modified version Nagios be it from thruk or whatever.

This branch of NSCA has some fixes for high load scenarios.

https://github.com/NagiosEnterprises/ns ... a-2-9-2RC1

MalcolmPreen · Post by **MalcolmPreen** » Wed Nov 09, 2016 10:13 am

nagios core is not modified

we currently used 4.2.1- but have plans to upgrade to 4.2.2 - hopefully next week

I will also investigate upgrading nsca - but before I do that, I need to prove (or otherwise) that I can reproduce the problem.

I have not had a lot of involvement with thruk - although my understanding is that this is just an additional interface from which to view the data within nagios core

As detailed above, it appears that it is possible within thruk to perform tasks (such as scheduling downtime for a host) - see previously attached cron file

What I don't understand is why nagios has died / been killed - but there is no record of this (that I can find). Is there something that can be set within nagios, or nagios.cfg to force something to be logged ?

At present, I'm making educated guesses - which isn't ideal.

Thanks, Malcolm

avandemore · Post by **avandemore** » Wed Nov 09, 2016 12:37 pm

My statement wasn't because I suspected you had modified Nagios Core proper, it's the environment it runs in.

That type of detail isn't generally recorded in a daemon's own logs unless in it clean invoked shutdown. However you can use Linux's auditd to track system level events such this:

https://www.digitalocean.com/community/ ... n-centos-7
https://linux-audit.com/configuring-and ... it-daemon/

Also rather than an upgrade, a migration to a clean and newer system is probably a better option.

MalcolmPreen · Post by **MalcolmPreen** » Fri Nov 18, 2016 10:03 am

Update:
.
We've had two stable weeks (not the longest ever - but I was too busy to attempt a reproduction).
.
I had removed all but one of the thruk scheduled downtime activities (previously there were 8).
.
Yesterday, I edited the /var/spool/cron/apache file - which thruk uses to schedule its downtime jobs (and restarted crond)
.
I copied the single entry (scheduled to run at 00:20) and as a first test, I set-up;
-
10 jobs at 00:20
10 jobs at 00:21
10 jobs at 00:22
10 jobs at 00:23
.
As hoped, nagios failed as it had before.... this time, the failure was at 00:22 - and from the log files, I can see that it had attempted 25 downtime requests.
.
To be 100% certain - I'll need to perform this task a few more times - but I've got a good feeling that it is related to the number of jobs being launched at the same time - and possibly something to do with the thruk/nagios interface.
.
Given that I can force the error using the thruk interface to nagios - my immediate suspicion is that the problem is not with nagios.... but potentially with the thruk interface (ie out of scope for this forum).
.
I will continue my investigations - and ideally use the suggestions from earlier in this thread to try to understand how / why nagios is failing.
.
However, my guess is that the best I can hope for is that there maybe a way to prevent nagios from dying in a future.
.
As the problem is related to the thruk / nagios interface - should I continue to post updates here ? (or should I just post a final update.... assuming I get one ?). Please advise.
.
Thanks for listening,
.
Malcolm

avandemore · Post by **avandemore** » Fri Nov 18, 2016 11:45 am

You can post here, but it's really going to be more relevant at thruk's support. As stated, we can't really help you much with the presence of thruk, but you can use something like daemon tools to automatically restart nagios if it dies. However I would only consider that as a temporary workaround.

MalcolmPreen · Post by **MalcolmPreen** » Fri Nov 18, 2016 12:09 pm

As promised, I started looking into setting up auditd;
.
It was already installed and running, and checking through the audit log file I found;
.
type=ANOM_ABEND msg=audit(1479428524.692:2057333): auid=501 uid=501 gid=501 ses=451278 pid=13887 comm="nagios" sig=11
[uid and gid 501 = nagios]
.
The type ANOM_ABEND is described as;
.
Triggered when a processes ends abnormally (with a signal that could cause a core dump, if enabled).
.
By default, cores are disabled.... so I have enabled cores, in preparation for the next re-production;
.
Malcolm
.
FWIW, I wouldn't expect nagios to dump core, even due to external "forces" - so hopefully anything I can uncover will be helpful ?
Or is my expectation invalid?

Nagios Support Forum

nagios dies - sometimes

Re: nagios dies - sometimes

Re: nagios dies - sometimes

Re: nagios dies - sometimes

Re: nagios dies - sometimes

Re: nagios dies - sometimes

Re: nagios dies - sometimes

Re: nagios dies - sometimes

Re: nagios dies - sometimes

Re: nagios dies - sometimes

Re: nagios dies - sometimes