I noticed this because suddenly the nagios process had started exiting and the watchdog script e-mails operations each time it has to start the daemon.
This continued with at least one unscheduled restart of nagios each day. I've activated debug logging but it fills up so quickly that I only have one reference, which is from this morning when the cron e-mail arrived at 08:10 to tell me the nagios server process was down and had been started by the script.
Code: Select all
[Mon Jul 30 08:09:44 2012.951684] [001.0] [pid=25295] process_macros()
[Mon Jul 30 08:09:44 2012.951799] [001.0] [pid=25295] process_macros()
[Mon Jul 30 08:09:44 2012.951817] [001.0] [pid=25295] process_macros()
[Mon Jul 30 08:09:44 2012.951832] [001.0] [pid=25295] process_macros()
[Mon Jul 30 08:09:44 2012.951848] [001.0] [pid=25295] process_macros()
[Mon Jul 30 08:09:44 2012.951863] [001.0] [pid=25295] process_macros()
[Mon Jul 30 08:10:02 2012.190888] [001.0] [pid=25378] drop_privileges() start
[Mon Jul 30 08:10:02 2012.285905] [001.0] [pid=25379] xrddefault_read_state_information() start
[Mon Jul 30 08:10:03 2012.814826] [001.0] [pid=25379] init_timing_loop() start
[Mon Jul 30 08:10:03 2012.814894] [001.0] [pid=25379] check_time_against_period()
[Mon Jul 30 08:10:03 2012.814910] [001.0] [pid=25379] check_time_against_period()The errors are very random, yesterday for example there was one at 00:10 and one at 00:45. The errors before were mostly once a day, at completely random times.
Never any errormessages, I can't find any core dumps.
The nagios slave is polled by the master using ssh and a special user with public keys and nagios permissions. It's a very small network and it has no more than 16 hosts and 82 services setup. Some remarkable perl scripts to monitor emc and humidity but no changes have been made to that for at least a year.
I'm out of ideas here. The only thing I can think of is that this coincided with the deployment of a new deployment system named ansible. But due to the remote location of this slave I could only use it for user/password administration and basic debian packages. No remote syslogging and no ldap login yet. So even though the nagios user can still login using its public key, and was never affected by the deployment system since I only setup the administrative users and root so far, that is the only major change done around the time of the start of failures.