Inbound Outboud Passive Check Troubleshooting

toodaly · Post by **toodaly** » Tue Oct 14, 2014 4:34 pm

I have four Nagios XI servers checking their local equipment (~180 service checks for each server). I've setup inbound/outbound checks between the four of them to send passive check information to each other. Ideally, each site should be able to see all site's equipment. I configure hosts/services as they come in under the unconfigured objects section. After a while, I stop getting unconfigured objects. I am getting incomplete sets of passive results on each of the servers.

Here's what I'm seeing (don't know if they are related)
/usr/local/nagios/var/nagios.log - Warning: Check result queue contained results for service <service name> on host <host name>, but the service could not be found! Perhaps you forgot to define the service in your config files?

/usr/local/nagios/var/nagios.log - Warning: Could not stat() check result file '/usr/local/nagios/var/spool/checkresults/<check result file>'.

/usr/local/nagios/var/nagios.log - Warning: The check of service <service name>on host <host name> looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...

Also, one of the servers in /usr/local/nagios/var/spool/checkresults is getting full of check result files to the point that if fills up the disk partition.

Are there any troubleshooting steps I can take, error logs I can look at that can lead to finding out why not all checks are coming in?

Thanks.

sreinhardt · Post by **sreinhardt** » Wed Oct 15, 2014 6:05 pm

I'm thinking you are likely repeatedly forwarding checks between systems. I don't believe the outbound sending from an XI system will filter received passive results by default. Reason I mention this is I think each time a system receives a check, it also forwards that same check on to all the other systems it is configured to. I could be forgetting about an option here too, I don't have a system up in front of me.

Regardless, if you can spare it, I would suggest turning off forwarding to all but one system to centralize it, for the moment. Then we need to work on the check results folder. You will loose the current results in there, but this is generally a fairly small window of time. Please do the following:

Code: Select all

cd /usr/local/nagios/var/spool/xidpe/
find . -type f -delete

Once thats completed, restart the npcd service.

toodaly · Post by **toodaly** » Tue Oct 21, 2014 5:11 pm

I tried what you suggested. I have server-4 sending its check results to server-3, server-3 sending its check results to server-2, and server 2 sending its check results to server-1. Server-1 can see passive check results from server-2, server-3, and server-4 as well as its own active checks. Things have been running fine, but this afternoon the last check times were from this morning. Here's what I'm seeing:
1) The last reported CPU utilization for server-1 was at 100%.
2) /usr/local/nagios/var/nagios.log contains numerous: Warning: the check of service <service name> on host <hostname> looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
3) /usr/local/nagios/var/spool/checkresults is full to the point it locks the terminal window.

I tried rm -f * to remove the check files but it doesn't work, so I tried the command you had above (find . -type f -delete). It is currently deleting the files as I'm typing this. Could this be a CPU issue? I have my virtual machine set with one CPU and 2GB of RAM. I thought I saw a chart somewhere that recommended a certain amount of CPU(s) for X number of service checks. Do passive checks count in this?

Are there any other troubleshooting steps, logs I should look at, etc. that I can use to get to the bottom of this?

Thanks

sreinhardt · Post by **sreinhardt** » Wed Oct 22, 2014 3:50 pm

You're probably fine as far as hardware goes for the moment, 2gb should be fine for your sized system. I think your largest issue is going to be that xidpe directory as when you can't stat it, it makes a whole mess for other parts. Once thats cleaned up, see how your system reacts and let us know! I really wouldn't do too much else right off the bat, let's take it one step at a time.

toodaly · Post by **toodaly** » Thu Oct 23, 2014 10:38 am

I forgot to mention it in my previous post. I did run the "cd /usr/local/nagios/var/spool/xidpe/" and "find . -type f -delete" commands on all four servers. Things ran fine for a couple of days and then I saw what I had stated earlier.

Are there any logs I can look at to get more information on what's going on?

Are there any .cfg files I can look at? With the exception of hosts and service .cfg files, everything else is default.

Thanks.

Post by **lmiltchev** » Fri Oct 24, 2014 12:17 pm

I have server-4 sending its check results to server-3, server-3 sending its check results to server-2, and server 2 sending its check results to server-1.

Typically, with inbound/outbound checks, you would have a totally different setup. You would use one of your XI servers as a "central" server and forward the checks from other XI boxes to it.
For example, if you made server-1 central, you will need to have:
server-2 sending its checks to server-1
server-3 sending its checks to server-1
server-4 sending its checks to server-1
...not 4 -> 3 -> 2 -> 1...
Try this setup and see if this is going to work for you.

Nagios Support Forum

Inbound Outboud Passive Check Troubleshooting

Inbound Outboud Passive Check Troubleshooting

Re: Inbound Outboud Passive Check Troubleshooting

Re: Inbound Outboud Passive Check Troubleshooting

Re: Inbound Outboud Passive Check Troubleshooting

Re: Inbound Outboud Passive Check Troubleshooting

Re: Inbound Outboud Passive Check Troubleshooting