Monitoring Engine service randomly stops, no active checks performed, status.dat occasionally disappears

pipv · Post by **pipv** » Fri Apr 05, 2024 10:59 am

Hi,
We've been running Nagios XI 5.8.7 for a few years internally for various server Monitoring. It's hosted on a Hyper-V VM running Centos 7 which was heavily based off the trial VM original provided. We then have our own status and alert site which uses the status.dat to populate. This has been running without issue for sometime.

As of a few weeks ago we started seeing issues with service checks being quite dated and not within the usual 5 minutes, digging into this we found the monitoring engine service had stopped. We restarted it thinking it was a one off and then has continued to happen since. While investigating we setup a cron job to restart the service every 30 minutes, as it would normally run for at least time without issue though is now failing more frequently. A new symptom is sometimes the status.dat will disappear.

We've managed to generate another status.dat by applying the config again which 9/10 works, though are still unsure of what is causing the monitoring service to stop.

The only thing that changed around the time of the issue was a few additional VMs were added for monitoring, configured the same as our other hosts, using the NCPA agent. Thinking this was the cause for some reason we removed the hosts and service checks but no change.

If anyone has any advice we would be grateful!

Post by **danderson** » Fri Apr 05, 2024 11:20 am

Thanks for reaching out @pipv,

Is it possible that you are running out of memory and the process is getting killed by the OOM killer? Is there anything in nagios logs that would indicate what's failing?

pipv · Post by **pipv** » Tue Apr 09, 2024 11:38 am

Hi @danderson thanks for getting back to me. This was our initial thought that it may be due to resources. We took a rather blunt approach to begin with and threw some extra resource at the VM while checking to see if anything spiked, nothing seemed out the ordinary.

Reviewing the nagios.log pulled from the server I can see before the monitoring service stops there is a service check consistently timing out after 60 seconds, error code 62. There are then entries on core worker jobs saying dormant child reaped and timed out, killing it, plus another entry about iocache_capacity is -1048576 for a worker ID which appears to be choked.

For now, I've disabled the checks in question to see if the monitoring service is stable, if so I'll dig deeper to see what the root cause is.

Cheers!

Nagios Support Forum

Monitoring Engine service randomly stops, no active checks performed, status.dat occasionally disappears

Monitoring Engine service randomly stops, no active checks performed, status.dat occasionally disappears

Re: Monitoring Engine service randomly stops, no active checks performed, status.dat occasionally disappears

Re: Monitoring Engine service randomly stops, no active checks performed, status.dat occasionally disappears