It is normal for them to be there temporarily, but if they aren't from the last several minutes it was likely caused by having multiple nagios processes running sometime in the past.
heck Services Being Orphaned
Some users have encountered large numbers of warning messages that accumulate quickly that read as follows:
Warning: The check of service <Your Service> on host <Your Host> looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service..
This is most likely caused by multiple instances of Nagios running. To fix this kill all instances of Nagios and then restart the process.
killall -9 nagios
Then restart Nagios from the Admin menu of the web interface.
Related forum post can be read here.
If the issue continues to persist after reboots and restarts of the Nagios service, then the issue is most likely caused by either a memory leak in embedded perl, or system ulimit restrictions. Symptoms can include the /tmp directory filling up quickly with check* files, and the following errors in the nagios log.
[1331905537] Warning: The check of service 'SERVICE' on host 'NAMESERVER' looks like it WAS
orphaned (results never Came back). I'm scheduling an immediate check of the service ...
[1331755699] Warning: The check of service 'SWAP' on host 'nameserver' not could be due to Performed
to fork () error 'Resource temporarily unavailable'. The check will be rescheduled.
Try the following solutions:
Edit /etc/security/limits.conf
* hard memlock 128 #locked memory
* soft memlock 128
* soft nofile 4096 #open files
* hard nofile 4096
* hard nproc 4096 #max user processes
* soft nproc 4096
* hard stack 20480 #stack size
* soft stack 20480
and restart the server. Run
ulimit -a
to verify that the new settings are in place.
And also update the settings in your nagios.cfg file to match the following:
enable_embedded_perl=0
use_embedded_perl_implicitly=0
I tried making the changes and killing Nagios from both the command line (service nagios stop / killall -9 nagios / serivce nagios start) as well as a full reboot of the server and I would still end up with hundreds of hosts and services showing critical (ping times were exceptionally long and over 50% packet loss for instance).
Yes, I did clear out all of the orphaned checks per the suggestion.
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ; do
This will give the nagios process plenty of time to cleanly close off any remaining checks. If you have a lot of checks that take a long time to complete, you could potentially have some temp files leftover upon a restart of the process.
This is immediately after rebooting the server this morning.
I came in to find about 300 hosts down with packet loss on ping. From my desktop, I'm able to ping without issue. One I reboot the system, it seems to resolve, for a little while anyways.
I am not using a distributed set-up. I have checked the plugin page, and do not see any plug-ins for either DNX or Gearman installed.
I did take over this install from a previous admin. This previous admin did not leave any documentation or any further information about how they set-up the system so I am having to figure it out as I go.