Weird and big issue with server

Post by **BanditBBS** » Sun Mar 22, 2015 1:10 am

Check this out:(Localhost # of processes started growing mid day Wednesday)

localhost-total_processes.jpg

That correlates exactly to when I updated the plugins on about 850 of our servers. We do use check_by_ssh to run the cpu/disk/memory/process/etc checks and those are the ones not ending. When it hits the peaks I get this:(ndo2db offloaded, ignore the 3rd one)

system status.JPG

When this happens I can look at top and it looks like Nagios is still running as I see checks running and perfdata is being logged, but XI just stops seeing any update information until I kill all the ssh connections with

Code: Select all

kill -9 `ps -ef | grep /usr/bin/ssh | grep -v grep | awk '{print $2}'`

and restart nagios process.

2 things...

1.) Any idea why updating plugins would make check_by_ssh stop closing out processes cleanly?
2.) Any idea why XI is behaving like it is when the process count gets that high? The server isn't under any stress or anything. Is it because I have max concurrent checks set to 4000?

Post by **BanditBBS** » Sun Mar 22, 2015 9:25 am

I found the issue. It is the check_disk plugin. I have a few servers that the check can't finish due to whatever issue. I have a timeout set on the nested check_disk of 240 and also a timeout on the check_by_ssh of 270. If something causes this to hang, the timeout on the check_disk isnt working and the check_by_ssh is timing out. The check by ssh closes but the child process does not and that is what is shooting my proc count high.

Besides fixing the disk issue as the obvious solution, any idea what I can do to resolve this issue as I described?

Post by **Box293** » Sun Mar 22, 2015 9:35 pm

What have you defined the nagios core service_check_timeout as?

Code: Select all

grep service_check_timeout /usr/local/nagios/etc/nagios.cfg

Post by **BanditBBS** » Sun Mar 22, 2015 9:38 pm

640. I know that without having to look.

Post by **Box293** » Sun Mar 22, 2015 10:02 pm

How about using the linux timeout command to kill the process on the remote box after x time?

Code: Select all

timeout 10 ping www.goooooogle.com

Post by **BanditBBS** » Mon Mar 23, 2015 9:32 am

Box293 wrote:How about using the linux timeout command to kill the process on the remote box after x time?
Code: Select all
timeout 10 ping www.goooooogle.com

Hmm, I would have loved to try that, but seems none of my linux servers know the timeout command, lol.

Also - http://unix.stackexchange.com/questions ... s-not-work
Read that, especially this:

A classical case of long uninterruptible sleep is processes accessing files over NFS when the server is not responding; modern implementations tend not to impose uninterruptible sleep (e.g. under Linux, the intr mount option allows a signal to interrupt NFS file accesses).

That is whats making it hang.

cmerchant · Post by **cmerchant** » Mon Mar 23, 2015 3:24 pm

Banditt, is your check_by_ssh check_disk checking nfs mounts as well? If this is hanging the ssh checks, you could add the -X nfs option, to exclude nfs mounted drives.

Example:

Code: Select all

/usr/local/nagios/libexec/check_by_ssh -H hostname -C "/usr/local/nagios/libexec/check_disk -w 20% -c 10% -A -i .gvfs -X nfs"

what was the version of plugins prior to the upgrade?

Post by **BanditBBS** » Mon Mar 23, 2015 3:29 pm

Yes, we are checking nfs. We were not before the upgrade, I needed to update the plugins to take advantage of other options(they were quite old!). Sorry if I sounded liek the upgrade was what made these start to hang, it is definitely the fact it is checking nfs mounts. We would like to keep checking the nfs mounts for space, but with this issue I'm not sure it is a possibility.

cmerchant · Post by **cmerchant** » Mon Mar 23, 2015 3:53 pm

Is it the same NFS server for all of the remote servers? Is this kinda redundant on checking the space, and could you check the NFS server space direct?

I suppose you want to know that the nfs mounts are currently mounted and active. Have you pushed out the new plugin's to all of the servers?

Post by **BanditBBS** » Mon Mar 23, 2015 4:26 pm

cmerchant wrote:Is it the same NFS server for all of the remote servers? Is this kinda redundant on checking the space, and could you check the NFS server space direct?

I suppose you want to know that the nfs mounts are currently mounted and active. Have you pushed out the new plugin's to all of the servers?

Yeah, I pushed it out to all servers. Its not the plugins fault, I'm just hoping to find a way to work around the issue though, that's why posting here. When the plugin hangs like this(and causing high server loads on remote server) even going to the cli of the remote server and typing df will result in a hung process. I fear there isn't a way around this and its just how its going to work with NFS when there is an issue.

Nagios Support Forum

Weird and big issue with server

Weird and big issue with server

Re: Weird and big issue with server

Re: Weird and big issue with server

Re: Weird and big issue with server

Re: Weird and big issue with server

Re: Weird and big issue with server

Re: Weird and big issue with server

Re: Weird and big issue with server

Re: Weird and big issue with server

Re: Weird and big issue with server