Page 1 of 1

Nagios slave keeps exiting/crashing

Posted: Mon Jul 30, 2012 1:33 am
by nocturnal
I have a nagios slave server in a remote network that consists of a bare bones, up 2 date, Debian squeeze with nagios core 3.2.1. It has worked fine for years before I started working here. About two weeks ago I noticed the original sysadmin had setup a watchdog script that periodically checks if the nagios daemon is running and restarts it.

I noticed this because suddenly the nagios process had started exiting and the watchdog script e-mails operations each time it has to start the daemon.

This continued with at least one unscheduled restart of nagios each day. I've activated debug logging but it fills up so quickly that I only have one reference, which is from this morning when the cron e-mail arrived at 08:10 to tell me the nagios server process was down and had been started by the script.

Code: Select all

[Mon Jul 30 08:09:44 2012.951684] [001.0] [pid=25295] process_macros()
[Mon Jul 30 08:09:44 2012.951799] [001.0] [pid=25295] process_macros()
[Mon Jul 30 08:09:44 2012.951817] [001.0] [pid=25295] process_macros()
[Mon Jul 30 08:09:44 2012.951832] [001.0] [pid=25295] process_macros()
[Mon Jul 30 08:09:44 2012.951848] [001.0] [pid=25295] process_macros()
[Mon Jul 30 08:09:44 2012.951863] [001.0] [pid=25295] process_macros()
[Mon Jul 30 08:10:02 2012.190888] [001.0] [pid=25378] drop_privileges() start
[Mon Jul 30 08:10:02 2012.285905] [001.0] [pid=25379] xrddefault_read_state_information() start
[Mon Jul 30 08:10:03 2012.814826] [001.0] [pid=25379] init_timing_loop() start
[Mon Jul 30 08:10:03 2012.814894] [001.0] [pid=25379] check_time_against_period()
[Mon Jul 30 08:10:03 2012.814910] [001.0] [pid=25379] check_time_against_period()
Nothing remarkable. Same situation in the normal log, sometimes it does nothing for a period of time and then I just see it start back up. Sometimes it has a little load and then suddenly I see it start back up around the time of the cron e-mail.

The errors are very random, yesterday for example there was one at 00:10 and one at 00:45. The errors before were mostly once a day, at completely random times.

Never any errormessages, I can't find any core dumps.

The nagios slave is polled by the master using ssh and a special user with public keys and nagios permissions. It's a very small network and it has no more than 16 hosts and 82 services setup. Some remarkable perl scripts to monitor emc and humidity but no changes have been made to that for at least a year.

I'm out of ideas here. The only thing I can think of is that this coincided with the deployment of a new deployment system named ansible. But due to the remote location of this slave I could only use it for user/password administration and basic debian packages. No remote syslogging and no ldap login yet. So even though the nagios user can still login using its public key, and was never affected by the deployment system since I only setup the administrative users and root so far, that is the only major change done around the time of the start of failures.

Re: Nagios slave keeps exiting/crashing

Posted: Wed Aug 01, 2012 9:59 am
by nscott
Are any of the files in the nagios/var directory over 2 gigabytes?