More box293_check_vmware fun

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
highness
Posts: 192
Joined: Thu May 01, 2014 4:25 pm

More box293_check_vmware fun

Post by highness »

We ran into another snag after we picked this ball back up this morning.

We have a vCenter cluster that has a large number of hosts, so consequently, each and every check type (vCenter, hosts, datastores, guests) times out.

Where can I change the timeout for these checks?
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: More box293_check_vmware fun

Post by lmiltchev »

Perhaps box293 will correct me if I am wrong... :)

I opened the plugin in a text editor and found out that the "default" timeout is 60 seconds. You could probably increase these value as much as it is needed (around line 287)

Code: Select all

timeout => {
		type => '=i',
		help => 'Specify the time a check is allowed to execute for. 60 seconds by default.',
		required => 0,
		default => 60,
		}
Be sure to check out our Knowledgebase for helpful articles and solutions!
highness
Posts: 192
Joined: Thu May 01, 2014 4:25 pm

Re: More box293_check_vmware fun

Post by highness »

lmiltchev wrote:Perhaps box293 will correct me if I am wrong... :)

I opened the plugin in a text editor and found out that the "default" timeout is 60 seconds. You could probably increase these value as much as it is needed (around line 287)

Code: Select all

timeout => {
		type => '=i',
		help => 'Specify the time a check is allowed to execute for. 60 seconds by default.',
		required => 0,
		default => 60,
		}
Actually tried that - increased it to 300, but it still times out.
User avatar
snapon_admin
Posts: 952
Joined: Mon Jun 10, 2013 10:39 am
Location: Kenosha, WI
Contact:

Re: More box293_check_vmware fun

Post by snapon_admin »

I believe you just have to add --timeout to the command. Something like this:

Code: Select all

check_by_ssh -E 1 -t 90 -l vi-admin -H xx.xx.xx.xx -C "~/box293_check_vmware.pl --concurrent_checks 400 --timeout 90 --check Host_CPU_Info --server xx.xx.xx.xx --host xx.xx.xx.xx
the first "-t 90" is for check_by_ssh, and "--timeout 90" is for box_293.
highness
Posts: 192
Joined: Thu May 01, 2014 4:25 pm

Re: More box293_check_vmware fun

Post by highness »

snapon_admin wrote:I believe you just have to add --timeout to the command. Something like this:

Code: Select all

check_by_ssh -E 1 -t 90 -l vi-admin -H xx.xx.xx.xx -C "~/box293_check_vmware.pl --concurrent_checks 400 --timeout 90 --check Host_CPU_Info --server xx.xx.xx.xx --host xx.xx.xx.xx
the first "-t 90" is for check_by_ssh, and "--timeout 90" is for box_293.
But which command would that need to be added to? All 41 of them?

UPDATE: Actually added the timeout to a couple of the commands (to 300 seconds) still no joy.
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: More box293_check_vmware fun

Post by Box293 »

Probably two or three things going in here....

How many checks have you created using the wizard?

I suspect that your vMA appliance may not have enough memory and CPU's. I would up it to 8GB and 2 CPUs. If the vMA does not have enough resources then almost all checks will timeout. Use the "top" command while logged onto the vMA as vi-admin to observe CPU and Memory usage.

Next, let's remove Nagios from the equation all together. We'll execute some checks on the vMA box directly and see how long they take.
  • ssh to the vMA as vi-admin

    Here's an example check:

    Code: Select all

    Command:
    ~/box293_check_vmware.pl --server vcenter.box293.local --check Guest_Snapshot --guest centos07 --timeout 600
    
    Output:
    OK: ['centos07' (Notes: Before Starting) (Age: 40)]
    I've added a super long timeout so we can find out how long it runs for

    Now we can "time how long it takes" by simply starting the command with time

    Code: Select all

    Command:
    time ~/box293_check_vmware.pl --server vcenter.box293.local --check Guest_Snapshot --guest centos07 --timeout 600
    
    Output:
    OK: ['centos07' (Notes: Before Starting) (Age: 40)]
    
    real	0m4.194s
    user	0m3.964s
    sys	0m0.068s
    
    So you can see that this took 4.194 seconds.
So with that information you now roughly know what --timeout value to use in the commands.

LASTLY ... if you need a timeout greater than 60 then READ THIS:
By default, Nagios Core has a timeout of 60 seconds by default, as defined in /usr/local/nagios/etc/nagios.cfg
service_check_timeout=60
If you need checks to run longer than 60 seconds then you need to change nagios.cfg and restart the Nagios service. In Nagios XI you can do this by:
CCM
Advanced
Nagios Core Main Config
Change service_check_timeout=
Click Save
Click Apply Configuration

Finally, answers to some of the other posts in the thread.
lmiltchev wrote:I opened the plugin in a text editor and found out that the "default" timeout is 60 seconds. You could probably increase these value as much as it is needed (around line 287)
Yes this would work, however --timeout is defined in the commands by default at 90, so changing it in the script would ignore the setting.

highness wrote:But which command would that need to be added to? All 41 of them?
Yes and no. There are 41 commands because the wizard was designed for some flexibility.

When looking at a service definition that is timing out, take not of the check command, this is the one that needs to be configured.

Box293 wrote:--timeout is defined in the commands by default at 90
Don't ask me to give a valid reason why I chose 90 in the commands when Nagios uses 60 by default :oops:


Let us know how you go and if any of this resolved your issues. This is some good troubleshooting I need to add to the manual.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
highness
Posts: 192
Joined: Thu May 01, 2014 4:25 pm

Re: More box293_check_vmware fun

Post by highness »

Box293 wrote:I suspect that your vMA appliance may not have enough memory and CPU's. I would up it to 8GB and 2 CPUs. If the vMA does not have enough resources then almost all checks will timeout. Use the "top" command while logged onto the vMA as vi-admin to observe CPU and Memory usage.
Boosted the vMA box to 2 CPUs and 8GB. I can now get the "Retrieve vCenter Objects" (which returns 23 clusters) to complete, but the rest (hosts, datastores, guests) still time out.
Box293 wrote:We'll execute some checks on the vMA box directly and see how long they take.
Here are a couple of checks I ran:

Code: Select all

vi-admin@vma01:~> time ~/box293_check_vmware.pl --server 10.YYY.YYY.YYY --check Guest_Snapshot --guest guest1.host.com --timeout 600; time ~/box293_check_vmware.pl --server 10.YYY.YYY.YYY --check Guest_Snapshot --guest guest2.host.com --timeout 600

OK: No snapshots found
real	0m8.982s
user	0m6.420s
sys	0m0.080s

OK: No snapshots found
real	0m7.938s
user	0m6.648s
sys	0m0.128s

In doing some digging through vCenter client, it appears that my largest vCenter host (10.YYY.YYY.YYY) has 50+ ESX hosts and 1500+ guests on those ESX hosts. So, if each single check is taking 8-9 seconds per guest, would I need to set the timeout to about 3.5 hours?
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: More box293_check_vmware fun

Post by Box293 »

OK I've got a better idea of what's happening now ... I need to do some more investigating and I'll reply back in a couple of hours.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: More box293_check_vmware fun

Post by lmiltchev »

Thanks for looking into this, Troy!
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: More box293_check_vmware fun

Post by Box293 »

OK So I didn't realise your problem was with the wizard ... this is somewhat simpler to resolve.

The wizard runs some "special" checks to perform the "Retrieve xxx Objects". The wizard has a hard coded timeout of 90 seconds. I will give you a command to change the timeout in the wizard, but first we need to know how long these "special" checks take to run.

When you execute these checks to time how long they take, a LOT of garbled text will be displayed ... we are only interested in the final time it takes to run the check.
ssh to the vMA as vi-admin
Type these four commands:

Code: Select all

time ~/box293_check_vmware.pl --server 10.YYY.YYY.YYY --timeout 600 --check List_Datastores
time ~/box293_check_vmware.pl --server 10.YYY.YYY.YYY --timeout 600 --check List_Guests
time ~/box293_check_vmware.pl --server 10.YYY.YYY.YYY --timeout 600 --check List_Hosts
time ~/box293_check_vmware.pl --server 10.YYY.YYY.YYY --timeout 600 --check List_vCenter_Objects
From the results, whichever one is the longest, double that number. I am going to use the number 300 in the next step.

Establish an ssh session to your Nagios XI Host
Type this command:

Code: Select all

sed -i 's/-t 90/-t 300/g'  /usr/local/nagiosxi/html/includes/configwizards/vmwarevirtualizationwizard/vmwarevirtualizationwizard_misc.php
Now go and run the configuration wizard, everthing should work fine now.

Let us know how you go.

Don't worry about all the talk in the other posts about changing the check commands, I had the wrong idea as to what was happening.


Thanks for your patience so far helping us resolve this problem. I will take away what you have experienced here and include a timeout option in the next version of the wizard (probably out in a couple of months).
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Locked