I have four Nagios XI servers checking their local equipment (~180 service checks for each server). I've setup inbound/outbound checks between the four of them to send passive check information to each other. Ideally, each site should be able to see all site's equipment. I configure hosts/services as they come in under the unconfigured objects section. After a while, I stop getting unconfigured objects. I am getting incomplete sets of passive results on each of the servers.
Here's what I'm seeing (don't know if they are related)
/usr/local/nagios/var/nagios.log - Warning: Check result queue contained results for service <service name> on host <host name>, but the service could not be found! Perhaps you forgot to define the service in your config files?
/usr/local/nagios/var/nagios.log - Warning: Could not stat() check result file '/usr/local/nagios/var/spool/checkresults/<check result file>'.
/usr/local/nagios/var/nagios.log - Warning: The check of service <service name>on host <host name> looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
Also, one of the servers in /usr/local/nagios/var/spool/checkresults is getting full of check result files to the point that if fills up the disk partition.
Are there any troubleshooting steps I can take, error logs I can look at that can lead to finding out why not all checks are coming in?
Thanks.
Inbound Outboud Passive Check Troubleshooting
-
sreinhardt
- -fno-stack-protector
- Posts: 4366
- Joined: Mon Nov 19, 2012 12:10 pm
Re: Inbound Outboud Passive Check Troubleshooting
I'm thinking you are likely repeatedly forwarding checks between systems. I don't believe the outbound sending from an XI system will filter received passive results by default. Reason I mention this is I think each time a system receives a check, it also forwards that same check on to all the other systems it is configured to. I could be forgetting about an option here too, I don't have a system up in front of me.
Regardless, if you can spare it, I would suggest turning off forwarding to all but one system to centralize it, for the moment. Then we need to work on the check results folder. You will loose the current results in there, but this is generally a fairly small window of time. Please do the following:
Once thats completed, restart the npcd service.
Regardless, if you can spare it, I would suggest turning off forwarding to all but one system to centralize it, for the moment. Then we need to work on the check results folder. You will loose the current results in there, but this is generally a fairly small window of time. Please do the following:
Code: Select all
cd /usr/local/nagios/var/spool/xidpe/
find . -type f -deleteNagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
Re: Inbound Outboud Passive Check Troubleshooting
I tried what you suggested. I have server-4 sending its check results to server-3, server-3 sending its check results to server-2, and server 2 sending its check results to server-1. Server-1 can see passive check results from server-2, server-3, and server-4 as well as its own active checks. Things have been running fine, but this afternoon the last check times were from this morning. Here's what I'm seeing:
1) The last reported CPU utilization for server-1 was at 100%.
2) /usr/local/nagios/var/nagios.log contains numerous: Warning: the check of service <service name> on host <hostname> looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
3) /usr/local/nagios/var/spool/checkresults is full to the point it locks the terminal window.
I tried rm -f * to remove the check files but it doesn't work, so I tried the command you had above (find . -type f -delete). It is currently deleting the files as I'm typing this. Could this be a CPU issue? I have my virtual machine set with one CPU and 2GB of RAM. I thought I saw a chart somewhere that recommended a certain amount of CPU(s) for X number of service checks. Do passive checks count in this?
Are there any other troubleshooting steps, logs I should look at, etc. that I can use to get to the bottom of this?
Thanks
1) The last reported CPU utilization for server-1 was at 100%.
2) /usr/local/nagios/var/nagios.log contains numerous: Warning: the check of service <service name> on host <hostname> looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
3) /usr/local/nagios/var/spool/checkresults is full to the point it locks the terminal window.
I tried rm -f * to remove the check files but it doesn't work, so I tried the command you had above (find . -type f -delete). It is currently deleting the files as I'm typing this. Could this be a CPU issue? I have my virtual machine set with one CPU and 2GB of RAM. I thought I saw a chart somewhere that recommended a certain amount of CPU(s) for X number of service checks. Do passive checks count in this?
Are there any other troubleshooting steps, logs I should look at, etc. that I can use to get to the bottom of this?
Thanks
-
sreinhardt
- -fno-stack-protector
- Posts: 4366
- Joined: Mon Nov 19, 2012 12:10 pm
Re: Inbound Outboud Passive Check Troubleshooting
You're probably fine as far as hardware goes for the moment, 2gb should be fine for your sized system. I think your largest issue is going to be that xidpe directory as when you can't stat it, it makes a whole mess for other parts. Once thats cleaned up, see how your system reacts and let us know! I really wouldn't do too much else right off the bat, let's take it one step at a time.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
Re: Inbound Outboud Passive Check Troubleshooting
I forgot to mention it in my previous post. I did run the "cd /usr/local/nagios/var/spool/xidpe/" and "find . -type f -delete" commands on all four servers. Things ran fine for a couple of days and then I saw what I had stated earlier.
Are there any logs I can look at to get more information on what's going on?
Are there any .cfg files I can look at? With the exception of hosts and service .cfg files, everything else is default.
Thanks.
Are there any logs I can look at to get more information on what's going on?
Are there any .cfg files I can look at? With the exception of hosts and service .cfg files, everything else is default.
Thanks.
Re: Inbound Outboud Passive Check Troubleshooting
Typically, with inbound/outbound checks, you would have a totally different setup. You would use one of your XI servers as a "central" server and forward the checks from other XI boxes to it.I have server-4 sending its check results to server-3, server-3 sending its check results to server-2, and server 2 sending its check results to server-1.
For example, if you made server-1 central, you will need to have:
server-2 sending its checks to server-1
server-3 sending its checks to server-1
server-4 sending its checks to server-1
...not 4 -> 3 -> 2 -> 1...
Try this setup and see if this is going to work for you.
Be sure to check out our Knowledgebase for helpful articles and solutions!