nagios dies - sometimes

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: nagios dies - sometimes

Post by dwhitfield »

Fingers crossed!

Also, I believe there should be another Core update coming later this week. Just keep an eye on your installs landing page (assuming you haven't modified it). It should let you know when there is an update.
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: nagios dies - sometimes

Post by MalcolmPreen »

4.2.3 is going through testing... current plan is to install early next week....
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: nagios dies - sometimes

Post by dwhitfield »

Awesome, we'll keep it open. :)
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: nagios dies - sometimes

Post by MalcolmPreen »

Interesting situation last night. Still running nagios core 4.2.2 and a similar failure occurred.

But, this time, the 80 cron jobs started OK (10 per minute for 8 minutes)

Code: Select all

20  0  *  *  * cd /usr/share/thruk && /bin/bash -l -c '/usr/bin/thruk -a downtimetask="hst_hostnameinsertedhere"' >/dev/null 2>>/var/lib/thruk/cron.log
And the failure occurred approximately an hour later...

The last entry in /usr/local/nagios/var/nagios.log before nagios was restarted was;

Code: Select all

[1481246456] PASSIVE SERVICE CHECK: host;service;output
but the first occurrence of the string 1481246456 in /usr/local/nagios/var/nagios.log is;

Code: Select all

[1481242857] EXTERNAL COMMAND: SCHEDULE_HOST_DOWNTIME;hostnameinsertedhere;1481242856;1481246456;1;0;0;(cron);automatic downtime

Code: Select all

1481242856 equates to Dec 09 2016 @ 00:20:56 [when the downtime started]
1481242857 equates to Dec 09 2016 @ 00:20:57 [when the cron job was launched]
1481246456 equates to Dec 09 2016 @ 01:20:56 [when the downtime ended]
So, the problem is different.... but the timings still seem related to the SCHEDULE_HOST_DOWNTIME (which were setup using thruk).... this time it happens when the downtime ends... rather than when it starts....

There is no core file recorded... based on my previous tests, either one wasn't generated?? or there was no write permission to the home directory when the process was started ?

I've restarted nagios from /tmp "just in case" - but is there a way I can ensure that the user nagios (which is the process owner) can always write to the directory?

Is there a way to find out how nagios died? (there is nothing that I can see in /var/log/audit/audit.log)

Any input appreciated, Malcolm
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: nagios dies - sometimes

Post by dwhitfield »

I did a little digging and it's not obvious at all what versions of nagios Thruk supports (all I could find is 4.x). The email we have associated on exchange is sven@nierlein.de. That is probably a better contact for this issue. Please let us know if you are unable to contact them through email.

Thruk just released a new stable version on Nov 28, so maybe you could try updating Thruk. They have repos at https://www.thruk.org/download.html

To clarify, you can certainly schedule downtime without Thruk.
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: nagios dies - sometimes

Post by MalcolmPreen »

Still planning to investigate thruk.... but just for information... we had a repeat on nagios core 4.2.3 (upgrading to 4.2.4 tomorrow)

Well aware that we can schedule downtime without thruk.... which is why the investigation needs to head that direction.... but keeping it in place so I don't forget... as we have a set-up that collects debug... and re-starts... and it is overnight... there is no impact to the server.

Fully expecting 4.2.4 upgrade to make no difference, and hope to get a chance to investigate thruk updates (or deletion !!) over the holidays.

Malcolm
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: nagios dies - sometimes

Post by dwhitfield »

MalcolmPreen wrote:hope to get a chance to investigate thruk updates (or deletion !!) over the holidays.
It'd be great if the upgrade on their side fixes things because I know you aren't the only person using it. Please let us know if you have any questions about removal, and at the very least, we will be able to help with migration.
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: nagios dies - sometimes

Post by MalcolmPreen »

OK, the current status.... nagios core is now 4.2.4

Over the xmas holidays we had a pair of failures....

So, as discussed, I'm investigating updating thruk.

We are currently running 1.80-3 - and 2.12-3 is available.

I've downloaded all of the available rpms;

Code: Select all

  4963208 Dec 28 13:56 libthruk-2.10-1.rhel6.i686.rpm
    2316 Dec 28 13:42 thruk-2.12-3.rhel6.i686.rpm
  5424012 Dec 28 13:56 thruk-base-2.12-3.rhel6.i686.rpm
 21429336 Dec 28 13:56 thruk-plugin-reporting-2.12-3.rhel6.i686.rpm
But if I try and install, I get the following;

Code: Select all

# yum install *
Loaded plugins: fastestmirror, security
Loading mirror speeds from cached hostfile
 * base: centos.serverspace.co.uk
 * epel: mirror.bytemark.co.uk
 * extras: mirror.sov.uk.goscomb.net
 * rpmforge: repoforge.mirror.wearetriple.com
 * updates: mirror.sov.uk.goscomb.net
Setting up Install Process
Examining libthruk-2.10-1.rhel6.i686.rpm: libthruk-2.10-1.el6.i686
Marking libthruk-2.10-1.rhel6.i686.rpm to be installed
Examining thruk-2.12-3.rhel6.i686.rpm: thruk-2.12-3.i686
Marking thruk-2.12-3.rhel6.i686.rpm as an update to thruk-1.80-3.x86_64
Examining thruk-base-2.12-3.rhel6.i686.rpm: thruk-base-2.12-3.i686
Marking thruk-base-2.12-3.rhel6.i686.rpm to be installed
Examining thruk-plugin-reporting-2.12-3.rhel6.i686.rpm: thruk-plugin-reporting-2.12-3.i686
Marking thruk-plugin-reporting-2.12-3.rhel6.i686.rpm to be installed
Resolving Dependencies
--> Running transaction check
---> Package libthruk.i686 0:2.10-1.el6 set to be updated
---> Package thruk.i686 0:2.12-3 set to be updated
---> Package thruk-base.i686 0:2.12-3 set to be updated
--> Processing Dependency: cronie for package: thruk-base
---> Package thruk-plugin-reporting.i686 0:2.12-3 set to be updated
--> Finished Dependency Resolution
thruk-base-2.12-3.i686 from /thruk-base-2.12-3.rhel6.i686 has depsolving problems
  --> Missing Dependency: cronie is needed by package thruk-base-2.12-3.i686 (/thruk-base-2.12-3.rhel6.i686)
Error: Missing Dependency: cronie is needed by package thruk-base-2.12-3.i686 (/thruk-base-2.12-3.rhel6.i686)
 You could try using --skip-broken to work around the problem
 You could try running: package-cleanup --problems
                        package-cleanup --dupes
                        rpm -Va --nofiles --nodigest
So.... I need the cronie rpm.... Today, everywhere I've looked for a Centos version is giving me nothing....

I've downloaded the source package... but even if I build and install that.... I'm not convinced that would resolve the dependency above... I could use --skip-broken.... but given that the original problem is related to "cron" type jobs... I'm not sure I should go that way.

The system already has vixie-cron installed

Code: Select all

# rpm -qa|grep cron
vixie-cron-4.1-81.el5
anacron-2.3-45.el5.centos
crontabs-1.10-11.el5
So - should I build the source package... and proceed ?
Or do I need to query this with the thruk team ?
or what?

Any suggestions ??
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: nagios dies - sometimes

Post by dwhitfield »

MalcolmPreen wrote: Or do I need to query this with the thruk team ?
This, or open a new thread. I doubt community members are going to see "nagios dies" in the subject and think, "I can help this guy with thruk."
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: nagios dies - sometimes

Post by MalcolmPreen »

Thanks ... proceeding down that line.... (contacting thruk direct).

Examining their website suggested updating using the ConSol labs repository...

Having ironed out a couple of local network routing issues, I've added the "stable" repository, and have upgraded thruk from 1.80 to 2.00

I'll try running with that.... and if there is a reproduction of the issue I will raise a ticket with them (with the option to upgrade to the "testing" repository.)

Thanks again for your input...
Locked