nagios dies - sometimes
-
- Former Nagios Staff
- Posts: 4583
- Joined: Wed Sep 21, 2016 10:29 am
- Location: NoLo, Minneapolis, MN
- Contact:
Re: nagios dies - sometimes
Fingers crossed!
Also, I believe there should be another Core update coming later this week. Just keep an eye on your installs landing page (assuming you haven't modified it). It should let you know when there is an update.
Also, I believe there should be another Core update coming later this week. Just keep an eye on your installs landing page (assuming you haven't modified it). It should let you know when there is an update.
-
- Posts: 63
- Joined: Wed Jan 25, 2012 9:21 am
Re: nagios dies - sometimes
4.2.3 is going through testing... current plan is to install early next week....
-
- Former Nagios Staff
- Posts: 4583
- Joined: Wed Sep 21, 2016 10:29 am
- Location: NoLo, Minneapolis, MN
- Contact:
Re: nagios dies - sometimes
Awesome, we'll keep it open.
-
- Posts: 63
- Joined: Wed Jan 25, 2012 9:21 am
Re: nagios dies - sometimes
Interesting situation last night. Still running nagios core 4.2.2 and a similar failure occurred.
But, this time, the 80 cron jobs started OK (10 per minute for 8 minutes)
And the failure occurred approximately an hour later...
The last entry in /usr/local/nagios/var/nagios.log before nagios was restarted was;
but the first occurrence of the string 1481246456 in /usr/local/nagios/var/nagios.log is;
So, the problem is different.... but the timings still seem related to the SCHEDULE_HOST_DOWNTIME (which were setup using thruk).... this time it happens when the downtime ends... rather than when it starts....
There is no core file recorded... based on my previous tests, either one wasn't generated?? or there was no write permission to the home directory when the process was started ?
I've restarted nagios from /tmp "just in case" - but is there a way I can ensure that the user nagios (which is the process owner) can always write to the directory?
Is there a way to find out how nagios died? (there is nothing that I can see in /var/log/audit/audit.log)
Any input appreciated, Malcolm
But, this time, the 80 cron jobs started OK (10 per minute for 8 minutes)
Code: Select all
20 0 * * * cd /usr/share/thruk && /bin/bash -l -c '/usr/bin/thruk -a downtimetask="hst_hostnameinsertedhere"' >/dev/null 2>>/var/lib/thruk/cron.log
The last entry in /usr/local/nagios/var/nagios.log before nagios was restarted was;
Code: Select all
[1481246456] PASSIVE SERVICE CHECK: host;service;output
Code: Select all
[1481242857] EXTERNAL COMMAND: SCHEDULE_HOST_DOWNTIME;hostnameinsertedhere;1481242856;1481246456;1;0;0;(cron);automatic downtime
Code: Select all
1481242856 equates to Dec 09 2016 @ 00:20:56 [when the downtime started]
1481242857 equates to Dec 09 2016 @ 00:20:57 [when the cron job was launched]
1481246456 equates to Dec 09 2016 @ 01:20:56 [when the downtime ended]
There is no core file recorded... based on my previous tests, either one wasn't generated?? or there was no write permission to the home directory when the process was started ?
I've restarted nagios from /tmp "just in case" - but is there a way I can ensure that the user nagios (which is the process owner) can always write to the directory?
Is there a way to find out how nagios died? (there is nothing that I can see in /var/log/audit/audit.log)
Any input appreciated, Malcolm
-
- Former Nagios Staff
- Posts: 4583
- Joined: Wed Sep 21, 2016 10:29 am
- Location: NoLo, Minneapolis, MN
- Contact:
Re: nagios dies - sometimes
I did a little digging and it's not obvious at all what versions of nagios Thruk supports (all I could find is 4.x). The email we have associated on exchange is sven@nierlein.de. That is probably a better contact for this issue. Please let us know if you are unable to contact them through email.
Thruk just released a new stable version on Nov 28, so maybe you could try updating Thruk. They have repos at https://www.thruk.org/download.html
To clarify, you can certainly schedule downtime without Thruk.
Thruk just released a new stable version on Nov 28, so maybe you could try updating Thruk. They have repos at https://www.thruk.org/download.html
To clarify, you can certainly schedule downtime without Thruk.
-
- Posts: 63
- Joined: Wed Jan 25, 2012 9:21 am
Re: nagios dies - sometimes
Still planning to investigate thruk.... but just for information... we had a repeat on nagios core 4.2.3 (upgrading to 4.2.4 tomorrow)
Well aware that we can schedule downtime without thruk.... which is why the investigation needs to head that direction.... but keeping it in place so I don't forget... as we have a set-up that collects debug... and re-starts... and it is overnight... there is no impact to the server.
Fully expecting 4.2.4 upgrade to make no difference, and hope to get a chance to investigate thruk updates (or deletion !!) over the holidays.
Malcolm
Well aware that we can schedule downtime without thruk.... which is why the investigation needs to head that direction.... but keeping it in place so I don't forget... as we have a set-up that collects debug... and re-starts... and it is overnight... there is no impact to the server.
Fully expecting 4.2.4 upgrade to make no difference, and hope to get a chance to investigate thruk updates (or deletion !!) over the holidays.
Malcolm
-
- Former Nagios Staff
- Posts: 4583
- Joined: Wed Sep 21, 2016 10:29 am
- Location: NoLo, Minneapolis, MN
- Contact:
Re: nagios dies - sometimes
It'd be great if the upgrade on their side fixes things because I know you aren't the only person using it. Please let us know if you have any questions about removal, and at the very least, we will be able to help with migration.MalcolmPreen wrote:hope to get a chance to investigate thruk updates (or deletion !!) over the holidays.
-
- Posts: 63
- Joined: Wed Jan 25, 2012 9:21 am
Re: nagios dies - sometimes
OK, the current status.... nagios core is now 4.2.4
Over the xmas holidays we had a pair of failures....
So, as discussed, I'm investigating updating thruk.
We are currently running 1.80-3 - and 2.12-3 is available.
I've downloaded all of the available rpms;
But if I try and install, I get the following;
So.... I need the cronie rpm.... Today, everywhere I've looked for a Centos version is giving me nothing....
I've downloaded the source package... but even if I build and install that.... I'm not convinced that would resolve the dependency above... I could use --skip-broken.... but given that the original problem is related to "cron" type jobs... I'm not sure I should go that way.
The system already has vixie-cron installed
So - should I build the source package... and proceed ?
Or do I need to query this with the thruk team ?
or what?
Any suggestions ??
Over the xmas holidays we had a pair of failures....
So, as discussed, I'm investigating updating thruk.
We are currently running 1.80-3 - and 2.12-3 is available.
I've downloaded all of the available rpms;
Code: Select all
4963208 Dec 28 13:56 libthruk-2.10-1.rhel6.i686.rpm
2316 Dec 28 13:42 thruk-2.12-3.rhel6.i686.rpm
5424012 Dec 28 13:56 thruk-base-2.12-3.rhel6.i686.rpm
21429336 Dec 28 13:56 thruk-plugin-reporting-2.12-3.rhel6.i686.rpm
Code: Select all
# yum install *
Loaded plugins: fastestmirror, security
Loading mirror speeds from cached hostfile
* base: centos.serverspace.co.uk
* epel: mirror.bytemark.co.uk
* extras: mirror.sov.uk.goscomb.net
* rpmforge: repoforge.mirror.wearetriple.com
* updates: mirror.sov.uk.goscomb.net
Setting up Install Process
Examining libthruk-2.10-1.rhel6.i686.rpm: libthruk-2.10-1.el6.i686
Marking libthruk-2.10-1.rhel6.i686.rpm to be installed
Examining thruk-2.12-3.rhel6.i686.rpm: thruk-2.12-3.i686
Marking thruk-2.12-3.rhel6.i686.rpm as an update to thruk-1.80-3.x86_64
Examining thruk-base-2.12-3.rhel6.i686.rpm: thruk-base-2.12-3.i686
Marking thruk-base-2.12-3.rhel6.i686.rpm to be installed
Examining thruk-plugin-reporting-2.12-3.rhel6.i686.rpm: thruk-plugin-reporting-2.12-3.i686
Marking thruk-plugin-reporting-2.12-3.rhel6.i686.rpm to be installed
Resolving Dependencies
--> Running transaction check
---> Package libthruk.i686 0:2.10-1.el6 set to be updated
---> Package thruk.i686 0:2.12-3 set to be updated
---> Package thruk-base.i686 0:2.12-3 set to be updated
--> Processing Dependency: cronie for package: thruk-base
---> Package thruk-plugin-reporting.i686 0:2.12-3 set to be updated
--> Finished Dependency Resolution
thruk-base-2.12-3.i686 from /thruk-base-2.12-3.rhel6.i686 has depsolving problems
--> Missing Dependency: cronie is needed by package thruk-base-2.12-3.i686 (/thruk-base-2.12-3.rhel6.i686)
Error: Missing Dependency: cronie is needed by package thruk-base-2.12-3.i686 (/thruk-base-2.12-3.rhel6.i686)
You could try using --skip-broken to work around the problem
You could try running: package-cleanup --problems
package-cleanup --dupes
rpm -Va --nofiles --nodigest
I've downloaded the source package... but even if I build and install that.... I'm not convinced that would resolve the dependency above... I could use --skip-broken.... but given that the original problem is related to "cron" type jobs... I'm not sure I should go that way.
The system already has vixie-cron installed
Code: Select all
# rpm -qa|grep cron
vixie-cron-4.1-81.el5
anacron-2.3-45.el5.centos
crontabs-1.10-11.el5
Or do I need to query this with the thruk team ?
or what?
Any suggestions ??
-
- Former Nagios Staff
- Posts: 4583
- Joined: Wed Sep 21, 2016 10:29 am
- Location: NoLo, Minneapolis, MN
- Contact:
Re: nagios dies - sometimes
This, or open a new thread. I doubt community members are going to see "nagios dies" in the subject and think, "I can help this guy with thruk."MalcolmPreen wrote: Or do I need to query this with the thruk team ?
-
- Posts: 63
- Joined: Wed Jan 25, 2012 9:21 am
Re: nagios dies - sometimes
Thanks ... proceeding down that line.... (contacting thruk direct).
Examining their website suggested updating using the ConSol labs repository...
Having ironed out a couple of local network routing issues, I've added the "stable" repository, and have upgraded thruk from 1.80 to 2.00
I'll try running with that.... and if there is a reproduction of the issue I will raise a ticket with them (with the option to upgrade to the "testing" repository.)
Thanks again for your input...
Examining their website suggested updating using the ConSol labs repository...
Having ironed out a couple of local network routing issues, I've added the "stable" repository, and have upgraded thruk from 1.80 to 2.00
I'll try running with that.... and if there is a reproduction of the issue I will raise a ticket with them (with the option to upgrade to the "testing" repository.)
Thanks again for your input...