Zombie processes hanging server using check_esx via sudo

bomahony · Post by **bomahony** » Thu May 30, 2019 6:45 am

Hey folks

We have hit an issue in the last two weeks where running check_esx via sudo is not terminating child processes, and the checks co0ntinue to generate more and more zombie procs until the XI checks stop altogether. Restarting the monitoring service from the GUI resolves this. This appears to have been working for the last few months.

XI version 5.5.8 [planning an upgrade to 5.6 in late June]

Post by **cdienger** » Thu May 30, 2019 11:41 am

Can you explain more about how it is configured using sudo? I'd also like to see a profile gathered when it is in this state(Admin > System Config > System Profile > Download Profile). Please PM this to me.

bomahony · Post by **bomahony** » Tue Jun 04, 2019 6:57 am

Was a bank holiday here yesterday, will get this today for you.

B

Post by **cdienger** » Tue Jun 04, 2019 11:33 am

Sounds good.

bomahony · Post by **bomahony** » Wed Jun 05, 2019 6:33 am

Apologies, we are currently using cron to restart the services every few hours, and apparently i need a CR to disable this even temporarily. It may be tomorrow when we get the data.

Post by **lmiltchev** » Wed Jun 05, 2019 1:43 pm

Noted.

Post by **cdienger** » Wed Jun 05, 2019 1:52 pm

I received the profile and it is likely hanging because there isn't an entry in the /etc/sudoers file to allow nagios to run the script without requiring a password. Make sure there is an entry like:

Code: Select all

nagios ALL=NOPASSWD: /usr/local/nagios/libexec/check_vmware_api.pl

bomahony · Post by **bomahony** » Thu Jun 06, 2019 5:01 am

Sorry mate, I might not have explained this properly. This works fine *most* of the time, and has been since last October. The sudoers is done in/etc/sudoers.d
root@mon01 0 11:00:07 /home/ # cat /etc/sudoers.d/nagios
Defaults:nagios !requiretty
Cmnd_Alias NAGIOSCMD = /usr/local/nagios/libexec/check_vmware_api.pl
nagios ALL = NOPASSWD: NAGIOSCMD

Do i need to add something else to terminate properly?

Post by **cdienger** » Thu Jun 06, 2019 12:56 pm

Thanks for the clarification. Does the situation improve if you run the check with a timeout set? Try running it with "-t 30" so that it times out after 30 seconds.

bomahony · Post by **bomahony** » Thu Jun 06, 2019 2:33 pm

Never even thought of that. I have a CR in for tomorrow and might try and sneak this in with it on one site and see how it goes.

Thing is, now that I think about it, the ESC check is actually checking customer stuff, and i reckon they upgraded to vcenter6.7 about then. Cannot confirm this, but it sounds possible. I checked the dates on the sudoers file and it was last Oct so it was running fine for 6mo+

Nagios Support Forum

Zombie processes hanging server using check_esx via sudo

Zombie processes hanging server using check_esx via sudo

Re: Zombie processes hanging server using check_esx via sudo

Re: Zombie processes hanging server using check_esx via sudo

Re: Zombie processes hanging server using check_esx via sudo

Re: Zombie processes hanging server using check_esx via sudo

Re: Zombie processes hanging server using check_esx via sudo

Re: Zombie processes hanging server using check_esx via sudo

Re: Zombie processes hanging server using check_esx via sudo

Re: Zombie processes hanging server using check_esx via sudo

Re: Zombie processes hanging server using check_esx via sudo