Weird and big issue with server

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Weird and big issue with server

Post by BanditBBS »

Check this out:(Localhost # of processes started growing mid day Wednesday)
localhost-total_processes.jpg
That correlates exactly to when I updated the plugins on about 850 of our servers. We do use check_by_ssh to run the cpu/disk/memory/process/etc checks and those are the ones not ending. When it hits the peaks I get this:(ndo2db offloaded, ignore the 3rd one)
system status.JPG
When this happens I can look at top and it looks like Nagios is still running as I see checks running and perfdata is being logged, but XI just stops seeing any update information until I kill all the ssh connections with

Code: Select all

kill -9 `ps -ef | grep /usr/bin/ssh | grep -v grep | awk '{print $2}'`
and restart nagios process.

2 things...

1.) Any idea why updating plugins would make check_by_ssh stop closing out processes cleanly?
2.) Any idea why XI is behaving like it is when the process count gets that high? The server isn't under any stress or anything. Is it because I have max concurrent checks set to 4000?
You do not have the required permissions to view the files attached to this post.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Weird and big issue with server

Post by BanditBBS »

I found the issue. It is the check_disk plugin. I have a few servers that the check can't finish due to whatever issue. I have a timeout set on the nested check_disk of 240 and also a timeout on the check_by_ssh of 270. If something causes this to hang, the timeout on the check_disk isnt working and the check_by_ssh is timing out. The check by ssh closes but the child process does not and that is what is shooting my proc count high.

Besides fixing the disk issue as the obvious solution, any idea what I can do to resolve this issue as I described?
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: Weird and big issue with server

Post by Box293 »

What have you defined the nagios core service_check_timeout as?

Code: Select all

grep service_check_timeout /usr/local/nagios/etc/nagios.cfg
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Weird and big issue with server

Post by BanditBBS »

640. I know that without having to look.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: Weird and big issue with server

Post by Box293 »

How about using the linux timeout command to kill the process on the remote box after x time?

Code: Select all

timeout 10 ping www.goooooogle.com
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Weird and big issue with server

Post by BanditBBS »

Box293 wrote:How about using the linux timeout command to kill the process on the remote box after x time?

Code: Select all

timeout 10 ping www.goooooogle.com
Hmm, I would have loved to try that, but seems none of my linux servers know the timeout command, lol.

Also - http://unix.stackexchange.com/questions ... s-not-work
Read that, especially this:
A classical case of long uninterruptible sleep is processes accessing files over NFS when the server is not responding; modern implementations tend not to impose uninterruptible sleep (e.g. under Linux, the intr mount option allows a signal to interrupt NFS file accesses).
That is whats making it hang.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
cmerchant
Posts: 546
Joined: Wed Sep 24, 2014 11:19 am

Re: Weird and big issue with server

Post by cmerchant »

Banditt, is your check_by_ssh check_disk checking nfs mounts as well? If this is hanging the ssh checks, you could add the -X nfs option, to exclude nfs mounted drives.

Example:

Code: Select all

/usr/local/nagios/libexec/check_by_ssh -H hostname -C "/usr/local/nagios/libexec/check_disk -w 20% -c 10% -A -i .gvfs -X nfs"
what was the version of plugins prior to the upgrade?
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Weird and big issue with server

Post by BanditBBS »

Yes, we are checking nfs. We were not before the upgrade, I needed to update the plugins to take advantage of other options(they were quite old!). Sorry if I sounded liek the upgrade was what made these start to hang, it is definitely the fact it is checking nfs mounts. We would like to keep checking the nfs mounts for space, but with this issue I'm not sure it is a possibility.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
cmerchant
Posts: 546
Joined: Wed Sep 24, 2014 11:19 am

Re: Weird and big issue with server

Post by cmerchant »

Is it the same NFS server for all of the remote servers? Is this kinda redundant on checking the space, and could you check the NFS server space direct?

I suppose you want to know that the nfs mounts are currently mounted and active. Have you pushed out the new plugin's to all of the servers?
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Weird and big issue with server

Post by BanditBBS »

cmerchant wrote:Is it the same NFS server for all of the remote servers? Is this kinda redundant on checking the space, and could you check the NFS server space direct?

I suppose you want to know that the nfs mounts are currently mounted and active. Have you pushed out the new plugin's to all of the servers?
Yeah, I pushed it out to all servers. Its not the plugins fault, I'm just hoping to find a way to work around the issue though, that's why posting here. When the plugin hangs like this(and causing high server loads on remote server) even going to the cli of the remote server and typing df will result in a hung process. I fear there isn't a way around this and its just how its going to work with NFS when there is an issue.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Locked