Monitoring Engine service randomly stops, no active checks performed, status.dat occasionally disappears
Posted: Fri Apr 05, 2024 10:59 am
Hi,
We've been running Nagios XI 5.8.7 for a few years internally for various server Monitoring. It's hosted on a Hyper-V VM running Centos 7 which was heavily based off the trial VM original provided. We then have our own status and alert site which uses the status.dat to populate. This has been running without issue for sometime.
As of a few weeks ago we started seeing issues with service checks being quite dated and not within the usual 5 minutes, digging into this we found the monitoring engine service had stopped. We restarted it thinking it was a one off and then has continued to happen since. While investigating we setup a cron job to restart the service every 30 minutes, as it would normally run for at least time without issue though is now failing more frequently. A new symptom is sometimes the status.dat will disappear.
We've managed to generate another status.dat by applying the config again which 9/10 works, though are still unsure of what is causing the monitoring service to stop.
The only thing that changed around the time of the issue was a few additional VMs were added for monitoring, configured the same as our other hosts, using the NCPA agent. Thinking this was the cause for some reason we removed the hosts and service checks but no change.
If anyone has any advice we would be grateful!
We've been running Nagios XI 5.8.7 for a few years internally for various server Monitoring. It's hosted on a Hyper-V VM running Centos 7 which was heavily based off the trial VM original provided. We then have our own status and alert site which uses the status.dat to populate. This has been running without issue for sometime.
As of a few weeks ago we started seeing issues with service checks being quite dated and not within the usual 5 minutes, digging into this we found the monitoring engine service had stopped. We restarted it thinking it was a one off and then has continued to happen since. While investigating we setup a cron job to restart the service every 30 minutes, as it would normally run for at least time without issue though is now failing more frequently. A new symptom is sometimes the status.dat will disappear.
We've managed to generate another status.dat by applying the config again which 9/10 works, though are still unsure of what is causing the monitoring service to stop.
The only thing that changed around the time of the issue was a few additional VMs were added for monitoring, configured the same as our other hosts, using the NCPA agent. Thinking this was the cause for some reason we removed the hosts and service checks but no change.
If anyone has any advice we would be grateful!