SNMP troubles after 5R1.0 upgrade

dlukinski · Post by **dlukinski** » Fri Oct 09, 2015 1:26 pm

I ran it on test VMWARE OVA 5R1.0 with all defaults: Disk Table: bool(false) Service Table: bool(false) Process Name Table: bool(false)
SNMP walk error was present in the 2nd screen

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

lgroschen wrote:Could you try the following from your Nagios XI server terminal:
Code: Select all
snmpwalk -v2c -c <community_string> <target_windows_host_IP>
and post the output from the command. If you setup the permission to the windows server correctly it should return a long walk otherwise it should tell you why the walk failed.

- returned extremely long walk / did not fail (tried on 2 systems with Windows SNMP Wizard affected by that error).
Once again I point that this is not about node's security, but 5R1.0 Windows SNMP Wizard in conjunction with buffer and Apache URI

dlukinski · Post by **dlukinski** » Fri Oct 09, 2015 1:48 pm

ssax wrote:First backup the original file:
Code: Select all
cp /usr/local/nagiosxi/html/includes/configwizards/windowssnmp/windowssnmp.inc.php /tmp/
The replace the file below with the one attached (unzip it first):
Code: Select all
/usr/local/nagiosxi/html/includes/configwizards/windowssnmp/windowssnmp.inc.php
Then run through the wizard again, on the second page where you get the error it should show some values at the very top, please post them here.

Then you can revert to the original file:
Code: Select all
cp /tmp/windowssnmp.inc.php /usr/local/nagiosxi/html/includes/configwizards/windowssnmp/windowssnmp.inc.php 
Thank you

-----------------------
Disk Table: bool(false) Service Table: bool(false) Process Name Table: bool(false)

dlukinski · Post by **dlukinski** » Fri Oct 09, 2015 2:05 pm

One more thing, half the time Wizard would produce "The wizard detected that this server does not have snmpwalk permission on the target host. This will prevent the automatic scan of services and processes and prevent services from running successfully" and yet list the server drives properly.

5R2.0 upgrade did not help :-\

Tried same-LAN servers monitoring via the Windows SNMP Wizard: works much better (all defaults)
- could this be latency issues and how to address them?

One specific host worked for same-LAN Nagios install, but not for the remote Nagios (no, there are no firewalls, just that distance)

jdalrymple · Post by **jdalrymple** » Sat Oct 10, 2015 9:02 am

1) Your support experience will be improved if you adhere to the advice offered twice now by staff. I'll repeat it again in case you've missed it - if you have anything to add to your diagnostic information please edit your last post, do not create a new one. If you create a new one your age counter resets and it looks to us like it's a "new" issue.

2) Am I correct in assuming that you're running the Linux Server wizard against an ESXi host and that's where the problem lies?

3) Regarding latency, how much latency are we talking about. SNMP can handle some, but it's more devastated by lossy connections than latency as it's a UDP (stateless) protocol. Sometimes it becomes necessary to wrap SNMP checks into SSH or NRPE or split off a gearman worker so that the transport protocol can be relied upon.

4) Can you be more specific about what broke as a result of the upgrade? What was working before the upgrade, and in what way did it stop working after the upgrade? The way I read it only the wizard broke, not existing hosts/services. I may be mistaken in my interpretation though.

dlukinski · Post by **dlukinski** » Mon Oct 12, 2015 7:32 pm

jdalrymple wrote:1) Your support experience will be improved if you adhere to the advice offered twice now by staff. I'll repeat it again in case you've missed it - if you have anything to add to your diagnostic information please edit your last post, do not create a new one. If you create a new one your age counter resets and it looks to us like it's a "new" issue.

Dimitri wrote:
- Really unsure how to keep editing same first post if I have quote new ones.

2) Am I correct in assuming that you're running the Linux Server wizard against an ESXi host and that's where the problem lies?

Dimitri wrote:
- NOT CORRECT AT ALL. I am running SNMP Windows wizard against mostly Windows 2008 R8 servers.
Your product (we were already approving purchase of) developed persisting errors after v5 upgrade: 1 x macBuffer + 1 x SNMP walk via Wizard.
I can't seem to get Nagios support team to this one still.

3) Regarding latency, how much latency are we talking about. SNMP can handle some, but it's more devastated by lossy connections than latency as it's a UDP (stateless) protocol. Sometimes it becomes necessary to wrap SNMP checks into SSH or NRPE or split off a gearman worker so that the transport protocol can be relied upon.

Dimitri wrote:
Not entirely sure about latency thing because CLI-based SNMP walk works, while Apache-based (Wizard) does not. Would also work until certain number of hosts reached. Really want to know more about "default" SNMP configurations your product has and any changes that happened in ver. 5.
One of the reasons we've chosen Nagios XI is because in old version its SNMP monitoring worked just fine (being less intrusive and more practical for many applications in our case, as we are talking hundreds of them)

4) Can you be more specific about what broke as a result of the upgrade? What was working before the upgrade, and in what way did it stop working after the upgrade? The way I read it only the wizard broke, not existing hosts/services. I may be mistaken in my interpretation though.

Dimitri wrote:
I mentioned in every other post of mine from the last week, so again:
Before an upgrade, SNMP-based monitoring against Windows 2008 R8 servers WORKED (many test installs, various same or different nodes).
After an upgrade I immediately had MaxBuffer errors for all hosts under monitoring (about 20-30 in the XI development trial), when it came to monitoring services/processes. I re-run upgrade manually (and done so on 4 XI test/dev installs in total), deleted all services/hosts and attempted to re-create monitoring for them via SNMP for Windows server Wizard. This is when I discovered that the Wizard would either fail right away on the second screen or work for the first 5-7 hosts and fail afterwards, or fail while still showing drives as if SNMP walk have partially worked. Faulty results were same on 3-4 test/dev installs (situated in different LANs in different geographic locations). In some cases elevating max_msg_size from 5000 to 10000 would allow me to run the Wizard few times before it fails, while in other cases it would fail right away anyways. Same fix would work when trying to deal with max_Buffer error, produced by SNMP-based service checks (as well as the error would return moment I switched max_msg_size back to 5000). 5.2 Upgrade did not help, but suddenly noticed that running this Wizard against same LAN hosts works w/o changing any settings at least for some time.

jdalrymple · Post by **jdalrymple** » Tue Oct 13, 2015 4:57 pm

This all makes it sound like 2 problems.

The wizard failing I can get because of the 10 second timeout specified in the wizard. Is it taking 10 seconds before it fails?

The precreated hosts/services, I don't have an answer for. They should eventually just turn into snmpgets. Do you still have any of the services that worked prior to the upgrade in place? If so please post a service definition. You can grab it in the CCM UI by clicking the disk icon.

dlukinski · Post by **dlukinski** » Thu Oct 15, 2015 8:01 am

jdalrymple wrote:This all makes it sound like 2 problems.

The wizard failing I can get because of the 10 second timeout specified in the wizard. Is it taking 10 seconds before it fails?

The precreated hosts/services, I don't have an answer for. They should eventually just turn into snmpgets. Do you still have any of the services that worked prior to the upgrade in place? If so please post a service definition. You can grab it in the CCM UI by clicking the disk icon.

A-ha.

This would be the message for the services that worked previously: ERROR: Process name table : Message size exceeded buffer maxMsgSize.
I believed this error was gone (when configuring services manually, but it just came back)

10 second wait does not happen, snmpwalk return errors almost immediately (2nd screen)
- is there way to change this default and how? (what would be impacted?)

ssax · Post by **ssax** » Thu Oct 15, 2015 5:15 pm

When you say that you increase maxMsgSize before, where were you adjusting it?

You can adjust the timeout in the script but if it doesn't even take 10 seconds then that is not the problem.

Code: Select all

            $disk_table = snmprealwalk($address, $snmpcommunity, $disk_oid, 10000);
            $w_service_table = snmprealwalk($address, $snmpcommunity, $w_service_oid, 10000);
            $process_name_table = snmprealwalk($address, $snmpcommunity, $process_oid, 10000);

Code: Select all

            $disk_table = snmprealwalk($address, $snmpcommunity, $disk_oid, 20000);
            $w_service_table = snmprealwalk($address, $snmpcommunity, $w_service_oid, 20000);
            $process_name_table = snmprealwalk($address, $snmpcommunity, $process_oid, 20000);

dlukinski · Post by **dlukinski** » Thu Oct 15, 2015 8:45 pm

ssax wrote:When you say that you increase maxMsgSize before, where were you adjusting it?

You can adjust the timeout in the script but if it doesn't even take 10 seconds then that is not the problem.

Code: Select all

            $disk_table = snmprealwalk($address, $snmpcommunity, $disk_oid, 10000);
            $w_service_table = snmprealwalk($address, $snmpcommunity, $w_service_oid, 10000);
            $process_name_table = snmprealwalk($address, $snmpcommunity, $process_oid, 10000);

Code: Select all

            $disk_table = snmprealwalk($address, $snmpcommunity, $disk_oid, 20000);
            $w_service_table = snmprealwalk($address, $snmpcommunity, $w_service_oid, 20000);
            $process_name_table = snmprealwalk($address, $snmpcommunity, $process_oid, 20000);

In /usr/local/nagios/libexec/check_snmp_win.pl ($session->max_msg_size(5000); Currently set to 10000 (futher increases were not helping).

ssax · Post by **ssax** » Fri Oct 16, 2015 2:50 pm

The plugin is not used while using the wizard so adjusting max_msg_size in the plugin will only affect the checks once they are already in place.

I don't see a way to increase max_msg_size with the snmprealwalk command (that command is what is giving the error), we will have to talk with the developers about this to get more information.

Nagios Support Forum

SNMP troubles after 5R1.0 upgrade

Re: SNMP troubles after 5R1.0 upgrade

Re: SNMP troubles after 5R1.0 upgrade

Re: SNMP troubles after 5R1.0 upgrade

Re: SNMP troubles after 5R1.0 upgrade

Re: SNMP troubles after 5R1.0 upgrade

Re: SNMP troubles after 5R1.0 upgrade

Re: SNMP troubles after 5R1.0 upgrade

Re: SNMP troubles after 5R1.0 upgrade

Re: SNMP troubles after 5R1.0 upgrade

Re: SNMP troubles after 5R1.0 upgrade