check_rrdtraf

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
SDK
Posts: 45
Joined: Wed Mar 21, 2012 4:23 pm

check_rrdtraf

Post by SDK »

Hello Nagios Support,

we have a strange problem in our environment.

We are using the standard wizard for adding new network switches. This wizard creates a MRTG config file, and a corresponding
service with check_rrdtraf, which checks for the MRTG rrd file.

We have the problem that the bandwidth graph from time to time show up a glitch, which doesnt make any sense:

Here one example:
graph.png
I traced the problem down to the following:

This is the content of the MRTG rrd file:

1371103500: 6.2738031371e+03 8.0207115058e+03
1371103800: 4.2759980537e+03 7.2949858804e+03
1371104100: 4.2759980537e+03 7.2949858804e+03
1371104400: 4.7058647467e+03 7.7942457101e+03
1371104700: 5.9542864011e+03 1.0424203964e+04

An this is the corresponding content of the check_rrdtraf, rrd file:

1371103860: 4.9902362900e+04 6.3743967412e+04
1371103920: 4.9902362900e+04 6.3743967412e+04
1371103980: 1.0038085019e+03 1.2833138409e+03
1371104040: 1.0038085019e+03 1.2833138409e+03
1371104100: 1.0038085019e+03 1.2833138409e+03
1371104160: 1.0038085019e+03 1.2833138409e+03
1371104220: 1.0038085019e+03 1.2833138409e+03
1371104280: 3.3523824741e+04 5.7192689302e+04
1371104340: 3.3523824741e+04 5.7192689302e+04
1371104400: 3.3523824741e+04 5.7192689302e+04

The glitch happend because of the values form: 1371103980 to 1371104220
As you can see the values form 1371103800 to 1371104100 in the MRTG rrd file didnt drop, but the ones in the check_rrdtraf, rrd file do!

I investigated further and found maybe a hint:

In the check_rrdtraf script, a command is excuted for the check of the MRTG rrd file:
It's this line here:

Code: Select all

DATASET=`rrdtool fetch $FILE AVERAGE -s-10minutes| grep -vi "nan"`
Depending on when i execute this command by myself i got sometimes 1 ouput line back, sometimes 2

Here is an example:

Code: Select all

rrdtool fetch /var/lib/mrtg/example.rrd AVERAGE -s-10minutes | grep -vi "nan"
                            ds0                 ds1

1371112500: 4.4748936875e+03 9.0650277233e+03
1371112800: 4.0733062489e+03 8.3976221581e+03
==========================================================================

rrdtool fetch /var/lib/mrtg/example.rrd AVERAGE -s-10minutes | grep -vi "nan"
                            ds0                 ds1

1371112800: 4.0733062489e+03 8.3976221581e+03
I think there is a timing issue maybe here. MRTG is executed per default cron job every 5 minutes. I must admit that we have quite a large environment with thousands of ports.
Ever seen this kind of problem?

Kind regards

Dominik
You do not have the required permissions to view the files attached to this post.
SDK
Posts: 45
Joined: Wed Mar 21, 2012 4:23 pm

Re: check_rrdtraf

Post by SDK »

Something to add here:

In the displayed timeframe i looked into the MRTG Log:

The reason why these 2 Values are idential in der MRTG rrd:

1371103800: 4.2759980537e+03 7.2949858804e+03
1371104100: 4.2759980537e+03 7.2949858804e+03

is the following:

ERROR: I guess another mrtg is running. A lockfile (/var/lock/mrtg/mrtg_l) aged
300 seconds is hanging around. If you are sure that no other mrtg
is running you can remove the lockfile

In this timeframe a previous MRTG cron instance wasn't able to finish due to server load. Is this a problem for the check_rrdtraf script?

Kind Regards

Dominik
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: check_rrdtraf

Post by slansing »

Have you noticed any service related issues at this time, have you checked the Nagios log at that timestamp? What version of SNMP does this device use? If you get the chance, please post the device's section from your MRTG configuration file. Thanks.
SDK
Posts: 45
Joined: Wed Mar 21, 2012 4:23 pm

Re: check_rrdtraf

Post by SDK »

slansing wrote:Have you noticed any service related issues at this time, have you checked the Nagios log at that timestamp? What version of SNMP does this device use? If you get the chance, please post the device's section from your MRTG configuration file. Thanks.
Hi Slansing, no the services are just fine. It's just the MRTG cron job didn't finished within the 5 minute interval. All our devices are polled per SNMP Version 1. I don't want to post the mrtg configuration file due to cooperate data (IP Adresses). But let me just say we have 1 MRTG config file with over 7000 targets.

Kind Regards

Dominik
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: check_rrdtraf

Post by slansing »

My interested "and reason for asking for your configuration 'section'" is that depending on what version of SNMP is shown in the config file, you can have unexpected performance graph issues.. You don't need to send the IP, you can block it out, it would help us check that off the list.

As far as MRTG timing goes, you could also increase the amount of time each host has to return it's data, like so:

Code: Select all

Target[192.168.5.2_3]: 3:[email protected]:::::2
In order to change from the default timeout to a new one, you need to edit the colon's at the end like so:

Code: Select all

Target[192.168.5.2_3]: 3:[email protected]::1:::2
Two colon symbols over is the timeout definition, the above would set the timeout to be 15 seconds as opposed to 2.

Before these changes are made, be sure to use the following document, and then after the changes are made, follow the directions at the end since you are editing the mrtg.cfg file by hand:

http://assets.nagios.com/downloads/nagi ... Router.pdf

Then allow a bit of time for new additions to the .rrd's to be drawn into Nagios Xi and check the results.
SDK
Posts: 45
Joined: Wed Mar 21, 2012 4:23 pm

Re: check_rrdtraf

Post by SDK »

slansing wrote:My interested "and reason for asking for your configuration 'section'" is that depending on what version of SNMP is shown in the config file, you can have unexpected performance graph issues.. You don't need to send the IP, you can block it out, it would help us check that off the list.

As far as MRTG timing goes, you could also increase the amount of time each host has to return it's data, like so:

Code: Select all

Target[192.168.5.2_3]: 3:[email protected]:::::2
In order to change from the default timeout to a new one, you need to edit the colon's at the end like so:

Code: Select all

Target[192.168.5.2_3]: 3:[email protected]::1:::2
Two colon symbols over is the timeout definition, the above would set the timeout to be 15 seconds as opposed to 2.

Before these changes are made, be sure to use the following document, and then after the changes are made, follow the directions at the end since you are editing the mrtg.cfg file by hand:

http://assets.nagios.com/downloads/nagi ... Router.pdf

Then allow a bit of time for new additions to the .rrd's to be drawn into Nagios Xi and check the results.
Hi again,

I posted that all our devices are checked with the SNMP Version 1...hence all targets in the mrtg file are :::::1 (This is what the wizard does by default). Increasing the Timeout for the SNMP Polling is not the problem. There were no timeouts, the MRTG cron job just didnt finished in time due to sever load and number of targets (over 7000), so the in and out numbers are for 2x 5 minute timeframes exactly the same. And this brings me back to my question. Is this a problem for the rrdtraf script...It's seems so judging by the content of the rrd file and the resulting glitch.
SDK
Posts: 45
Joined: Wed Mar 21, 2012 4:23 pm

Re: check_rrdtraf

Post by SDK »

I have another expample:

MRTG RRD File (2 times same exact values):

1371132900: 2.0153774466e+02 9.0764725332e+02
1371133200: 2.0015082931e+02 8.9392800006e+02
1371133500: 2.0015082931e+02 8.9392800006e+02
1371133800: 2.1601324438e+02 8.8775528791e+02
1371134100: 3.1672640410e+02 1.5755990628e+03

rrdtraf RRD File (smaller numbers = glitch):

1371133320: 1.6135431313e+03 7.2556542928e+03
1371133380: 3.2246039140e+01 1.4522356054e+02
1371133440: 3.2246039140e+01 1.4522356054e+02
1371133500: 3.2246039140e+01 1.4522356054e+02
1371133560: 3.2246039140e+01 1.4522356054e+02
1371133620: 3.2246039140e+01 1.4522356054e+02

1371133680: 1.5691825013e+03 7.0083955200e+03

The same behavior:

A glitch in the graphic due to wrong calculation of the bandwidth with the rrdtraf check script:
graph2.png
The reason for the exact same values over 10 minutes is again:

2013-06-13 16:20:01 -- 2013-06-13 16:20:01: ERROR: I guess another mrtg is running. A lockfile (/var/lock/mrtg/mrtg_l) aged
299 seconds is hanging around. If you are sure that no other mrtg
is running you can remove the lockfile

I think have to split up the mrtg.cfg into multiple cfg's and modify the crond and lock entries accordingly to support multiple mrtg runs.
It's just a little annoying since we wanted to do everything over the webfrontend. In this case i think we have to compromise!

Kind regards

Dominik
You do not have the required permissions to view the files attached to this post.
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: check_rrdtraf

Post by sreinhardt »

If I am reading this correctly, you are monitoring a device that goes over 100Mb/s with snmpv1. There is a known although not well documented issue where mrtg cannot handle something about this and simply reports nothing or an incorrect number. I would highly suggest changing one of your devices to use snmpv2, rerun the switch and router wizard to use v2 and see how it works from there.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
SDK
Posts: 45
Joined: Wed Mar 21, 2012 4:23 pm

Re: check_rrdtraf

Post by SDK »

sreinhardt wrote:If I am reading this correctly, you are monitoring a device that goes over 100Mb/s with snmpv1. There is a known although not well documented issue where mrtg cannot handle something about this and simply reports nothing or an incorrect number. I would highly suggest changing one of your devices to use snmpv2, rerun the switch and router wizard to use v2 and see how it works from there.
Hello Sreinhardt, iam aware of that problem due to the multiple overflow of the 32bit counter on high speed link ports under V1, with SNMP V2 using the 64 bit counter resolving this.
That though isnt the case in my examples. We are talking here in Kbs. As i posted in my previous comment. MRTG has the same values twice in its RRD's because the previous mrtg process didnt finished in 5 minutes due to load issues.

The check_rrdtraf scripts seems to mess things up then.

Here is the scale of my last example:
graph3.png
Kind regards
You do not have the required permissions to view the files attached to this post.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: check_rrdtraf

Post by abrist »

You may want to check your perfdata and npcd logs for load/timeout warnings:

Code: Select all

tail -25 /usr/local/nagios/var/perfdata.log
tail -25 /usr/local/nagios/var/npcd.log
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Locked