Nagios, Perl plugins and Persistence

mrbungie · Post by **mrbungie** » Thu May 02, 2013 3:59 pm

Hi everyone,

I'm using Nagios for monitoring some VMware vCenter Servers in a production environment using this perl plugin http://www.op5.org/community/plugin-inv ... esx-plugin. The checks (CPU, MEM, NET, RUNTIME, IO) work fine for a couple of hosts but once I begin monitoring too many servers I start having problems with the checks. The plugin just generates too many connections, and to defeat that I think I need connection persistance between checks.

From what I've gathered, the actual problem is that even when the SDK connection routines (that the plugin use) were made to look for open connections before opening new ones, it will obviously just look for those created during runtine. And as we all know, Nagios fork()s everytime (two times by default) it runs an active check, so finally the plugin just creates a connection for every check and for every host.

From what I've read I have these options:

Transforming the plugin into a daemon and making these checks passive (for now it looks like my first option)
Modifying the plugin, and using some stuff from CPAN to achieve IPC (likely to work, but dirty as hell)
Managing check scheduling in such a way that this situation never happens (like setting max_concurrent_checks to 1, but I fear that would slow everyting)

So, guys, what would you do in this situation? I would greatly appreciate any help.
Thanks!

mrbungie · Post by **mrbungie** » Thu May 02, 2013 6:42 pm

I completely forgot about my system details. I'm running the 2012R1 Nagios XI appliance (posted this here because this problem has to do with Core itself) with both vSphere SDK for Perl and op5 check_esxi perl plugin deployed. And I'm monitoring vCenter Server 5.1 over Windows Server 2008R2 and SQL Server 2008 managing ESXi 5.1 hypervisors.

Thanks!

Post by **jsmurphy** » Thu May 02, 2013 8:08 pm

I actually worked with VMWare and Op5 for a couple months on this particular problem as the only feasible option for those of us who are heavily virtualised was to run a server dedicated to executing that check, which was ridiculous. I never ended up hearing the end result of the discussion between VMWare and Op5 but I don't think it ended with anything particularly useful coming of it.

Well... except I ended up creating my own solution that uses the standard VMWare alarm interface and NRDP:
http://exchange.nagios.org/directory/Pl ... er/details
or
http://roshamboot.org/main/projects/vmw ... or-nagios/

VMWare ended up advocating this approach, at least in my corner of the world anyway.

It does have some draw backs, it isn't great for collecting perf data, although you still can if you really want. Hopefully this helps you without re-inventing the wheel

mrbungie · Post by **mrbungie** » Fri May 03, 2013 12:17 am

jsmurphy wrote:I actually worked with VMWare and Op5 for a couple months on this particular problem as the only feasible option for those of us who are heavily virtualised was to run a server dedicated to executing that check, which was ridiculous. I never ended up hearing the end result of the discussion between VMWare and Op5 but I don't think it ended with anything particularly useful coming of it.

Well... except I ended up creating my own solution that uses the standard VMWare alarm interface and NRDP:
http://exchange.nagios.org/directory/Pl ... er/details
or
http://roshamboot.org/main/projects/vmw ... or-nagios/

VMWare ended up advocating this approach, at least in my corner of the world anyway.

It does have some draw backs, it isn't great for collecting perf data, although you still can if you really want. Hopefully this helps you without re-inventing the wheel

Hi jsmurphy, thanks for your prompt reply, I was sure someone had to know about this problem. Can you share more about those discussions? I want to fill my report about the issue with a little bit more information, and what you know might prove useful.

Thanks!

sreinhardt · Post by **sreinhardt** » Fri May 03, 2013 3:05 pm

As you have seen, and jsmurphy started to explain, the issue is with the sheer number of connections needed for checks. The sdk and vmware servers simply do not allow a persistent connection as far as I know, aside from with their native client. Because of this, you will be forced to create a new connection, authenticate, and run your check. As was mentioned, setting up a dedicated server for this check would be an option, a mod_gearman setup might be another good alternative. Regardless, the same issues seen with amount of connections, and logs issues within vmware servers. This simply cannot be avoided with the vmware sdk and current plugins, unless you wish to go with the alert/passive check combination, that does sound like a rather good idea.

Post by **jsmurphy** » Fri May 03, 2013 5:42 pm

Pretty much what sreinhardt said.

The SDK wasn't designed to be used like this and there wasn't really anything VMWare was willing (or could) do to alleviate the situation, they did offer a few small efficiency improvement tips but nothing that would fix the underlying issue of requiring the check to be offloaded. Prior to the engagement I toyed with the idea of writing a version of the check that basically cached the data for Nagios to query but given my previous experience coding for the VMWare SDK I concluded that I still wasn't going to be able to get that to an efficiency level that I would be happy with even using batch queries.

That's when I noticed pouring through an old version of the VMWare doc that they had ENV vars that could be accessed on alarm execution that had gone missing from the new versions of the doc. Turns out the vars still existed, VMWare amended the doco and the above plugin resulted. Op5 and VMware were still communicating but I don't know anything past this point other than providing some perf metrics for them.

mrbungie · Post by **mrbungie** » Fri May 03, 2013 9:52 pm

I will try what you said, thank you both

.

slansing · Post by **slansing** » Mon May 06, 2013 10:15 am

Let us know how these suggestions work our for you mrbungie!

Nagios Support Forum

Nagios, Perl plugins and Persistence

Nagios, Perl plugins and Persistence

Re: Nagios, Perl plugins and Persistence

Re: Nagios, Perl plugins and Persistence

Re: Nagios, Perl plugins and Persistence

Re: Nagios, Perl plugins and Persistence

Re: Nagios, Perl plugins and Persistence

Re: Nagios, Perl plugins and Persistence

Re: Nagios, Perl plugins and Persistence