Nagios XI - Debugging Bandwidth Performance Graphs

Overview

This article covers troubleshooting bandwidth performance graphs when they are not appearing as expected. This can include:

No graphs at all
Gaps of no data in the performance graphs

Common causes are:

Cron daemon is not running
Corrupt configuration files
Deprecated MRTG config files causing MTRG to run longer than five minutes
File / Folder permissions
MRTG config files logging errors
MTRG running longer than five minutes
Command not executing correctly
Directory Missing
SNMP Configuration Incorrect

Editing Files

In many steps of this article you will be required to edit files. This documentation will use the vi text editor. When using the vi
editor:

To make changes press i on the keyboard first to enter insert mode
Press Esc to exit insert mode
When you have finished, save the changes in vi by typing :wq and press Enter

Cron Daemon

Make sure the cron daemon is running by running the following command:

RHEL 7 | CentOS 7 | Oracle Linux 7

systemctl status crond.service

Debian | Ubuntu 16/18

systemctl status cron.service

If the cron daemon is not running, start it by running the following command:

RHEL 7 | CentOS 7 | Oracle Linux 7

systemctl start crond.service

Debian | Ubuntu 16/18

systemctl start cron.service

Corrupt Files / Deprecated Files

Corrupt files could be caused by an unexpected shutdown of the server or if the server's drive has filled up and couldn't save the current bandwidth data. The configuration files are in /etc/mrtg/conf.d/

Here is what a sample configuration file's permission is set to.

-rw-r--r-- 1 apache apache 33332 Dec 10 16:00 192.168.5.43.cfg

To troubleshoot the corrupt files, run the following command and if any errors are displayed, resolve the errors and re-run the command.

LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg

Deprecated files are config files of devices that you no longer want to monitor. You may have deleted the services in Core Configuration Manager (CCM) however MRTG will continue to try and poll that device. This can have a chain re-action effect causing MRTG to run longer than five minutes and prevent collection of data from the devices you want to monitor.

Running the previous command should highlight any devices that are timing out. Simply delete the config files using the following command, this example is for the device 192.168.5.43 :

rm -f /etc/mrtg/conf.d/192.168.5.43.cfg

File / Folder Permissions

The performance data files are stored in 2 locations on the XI server.

The first folder is /var/lib/mrtg

This is where the Bandwidth graphs are stored.

Here is a sample of the permissions look like.

-rw-rw-r--   1 apache nagios 105312 Jan 27 13:25 192.168.5.43_71.rrd
-rw-rw-r--   1 apache nagios 105312 Jan 27 13:25 192.168.5.43_72.rrd
-rw-rw-r--   1 apache nagios 105312 Jan 27 13:25 192.168.5.43_73.rrd
-rw-rw-r--   1 apache nagios 105312 Jan 27 13:25 192.168.5.43_74.rrd
-rw-rw-r--   1 apache nagios      0 Jan 27 13:25 mrtg.ok

To reset the permissions on the folders and files in /var/lib/mrtg execute the following commands:

RHEL | CentOS | Oracle Linux

cd /var/lib/mrtg
chown apache:nagios *
chmod 0664 *

Debian | Ubuntu

cd /var/lib/mrtg
chown www-data:nagios *
chmod 0664 *

The second folder is at this location:

/usr/local/nagios/share/perfdata

This is where the performance data for all of the hosts and services are stored.

Here is a sample of the permissions look like:

drwxrwxr-x  2 nagios nagios 4096 Jan 15 11:29 192.168.1.1
drwxrwxr-x  2 nagios nagios 4096 Jan 27 13:30 192.168.5.43

The permissions of the folder 192.168.5.43 look like this:

-rw-rw-r-- 1 nagios nagios 1534768 Nov 29 13:42 _HOST_.rrd
-rw-rw-r-- 1 nagios nagios    3892 Nov 29 13:42 _HOST_.xml

To reset the permissions on the folders and files in /usr/local/nagios/share/perfdata execute the following commands:

cd /usr/local/nagios/share/perfdata
for folder in `find . -type d`; do chown -R nagios:nagios $folder; done
for folder in `find . -type d`; do chmod 0775 $folder; done
for folder in `find . -type d`; do chmod 0664 $folder/*; done

MRTG Config Files Logging Errors

When you run the Switch / Router wizard in XI, an MRTG config file is created in /etc/mrtg/conf.d and will include all valid ports detected. Even when only selected a specific amount of ports to monitor in the wizard, the config file will contain all the valid ports and will collect data for those ports every five minutes.

If MRTG has trouble collecting data from a device, it will log this in the root mailbox. If you are not regularly checking this mailbox, the size of the mailbox will grow and over time can slow down MRTG.

To identify any ports that MRTG has problems with, execute this command in a terminal session:

LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg

The output will direct you to the port that it has errors with. You can comment out these ports in the relevant config files by using a hash #. Note: each port in the config file is 37 lines long, you need to comment out all 37 lines.

In addition to this, if you are only collecting a few ports from a 24 port switch, you can comment out all the ports you don't need in the config files. There is no point collecting data every five minutes if it is not being used.

MRTG Running Longer Than Five Minutes

When MRTG runs every five minutes, it is assumed that it will complete within five minutes. If it is still running the next time it runs at the five minute interval, it will terminate as there is already an MRTG job running. This means data is not collected from devices at this interval.

You can identify how long it takes for MRTG to run by executing the following command:

time LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg

The output will end with how long the command took to execute.

To resolve this issue, you should first follow the previous topics on deprecated config files, and logging errors.

Next, you can increase the amount of forks MRTG is allowed to spawn when it executes. In Nagios XI, MRTG is configured by default to fork four instances. This means when MRTG executes, it counts up all the config files and divides them into four, hence cutting the time it takes to poll all the devices by four.

You can increase how many forks MTRG is allowed to spawn. This is defined in /etc/mrtg/mrtg.cfg as per the following directive:

Forks: 4

Increase the number as required. For example increase it to 8. The next time the MRTG cron job runs it will use the new setting.

Command Not Executing Correctly

Try running the command that Nagios XI runs to check status of a device. For instance, when monitoring a router or switch, Nagios XI uses the check_rrdtraf plugin.

Test running this plugin manually by running a check, similar to the following:

/usr/local/nagios/libexec/check_rrdtraf -f '/var/lib/mrtg/192.168.6.1_1.rrd' -w 1 -c 2

This should return something that looks like:

OK - Current BW in: 1.57Kbps Out: 365.41bps|in=1.573002Kb/s;1;2 out=365.413424b/s;1;2

If it gives errors, fix the issues the error gives and then Nagios XI can start graphing performance data.

Directory Missing

Make sure the /var/lock/mrtg/ directory exists. It has been witnessed that this directory will occasionally disappear.

You can check the /var/spool/mail/root mailbox using this command:

grep templock /var/spool/mail/root

If you see the following error then you are experiencing the problem:

2016-10-03 19:45:02: ERROR: Creating templock /var/lock/mrtg/mrtg_l_5612: No such file or directory at /usr/bin/mrtg line 1961

Recreate the folder using the following command:

mkdir /var/lock/mrtg

SNMP Configuration Incorrect

Older verions of the Switch Wizard called mrtg with arguments for SNMPv2c, which MRTG does not use. Here is an example of an incorrect setting from a config file in /etc/mrtg/conf.d/:

Target[www.hostaddress.com]: 1:[email protected]:::::2c

Notice that after the multitude of colons, there is a 2c, this represents the SNMP version MRTG will use to poll the device. If this is instead 2c, change it to 2 and save the file. This will need to be done to every metric that is affected by being created with 2c. Entries should look like:

Target[www.hostaddress.com]: 1:[email protected]:::::2

Final Thoughts

For any support related questions please visit the Nagios Support Forums at:

http://support.nagios.com/forum/