We have a central server - which monitors itself - and collect NSCA reports for all of the hosts and services which are monitored by off-site tablets.
All of the systems are running Centos - and since last week, they are all running the latest nagios, 4.2.1.
The problem (the nagios process on the central server failing), has been happening occasionally (once or twice a month) - and from what I can see it almost always fails at around 00:50. It has been happening since we first migrated onto nagios 4 (and possibly even on nagios 3 - but that was quite a time ago),
I become aware of the problem, because the dashboard data for all of the hosts and services is out of date - but interestingly, if I refresh my browser it is managing to get the information from nagios (even though it appears to not be running) - perhaps something is cached or otherwise available in memory?
Additionally, there are many nsca processes running - and /var/log/messages starts to log messages such as;
Code: Select all
date/time host-name xinetd[3613]: FAIL: nsca service_limit from=source
There is nothing, that I have found to date, within /var/log/messages or /usr/local/nagios/var/nagios.log which indicates a failure .... except for a lack of messages.
To restore the system back to normality, I can "just" restart nagios as normal - although I also have to kill -1 the old nsca processes (as until I do that, no new processes can run... and no more data is received).
So.... is this a problem which is known ?
Assuming it isn't, what can I do to provide diagnostics ?
I have the logs I have at present... but I can turn on other debug "for next time" if that would help. I have not attached any logs as yet, as they are all huge.
Any advice greatly appreciated.
Thanks, Malcolm