MRTG consumes 100% of system resources

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
TBT
Posts: 625
Joined: Wed May 18, 2011 1:26 pm

MRTG consumes 100% of system resources

Post by TBT »

It appears when cron runs MRTG, the process hangs and re-spawns, eventually consuming 100% of system resources. This was introduced after an upgrade to XI 5.5.7 and oddly enough, only affects 1 of our 9 XI servers. Time stamp on the rrd files in /var/lib/mrtg are not updating, we also checked file permissions and ownership on /etc/mrtg and /var/lib/mrtg (mentioned in another thread). No errors present in /var/log/messages.

CentOS 6.10
rrdtool-1.3.8-7.el6.x86_64
glib2-2.28.8-10.el6.x86_64

Any insight?
Nagios XI 2024R2.2.1 (8 Servers)
Nagios Fusion 2024R1.0.2
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: MRTG consumes 100% of system resources

Post by scottwilkerson »

There was a bug introduced in the Switch Wizard and it should be updated
Admin -> Manage Config Wizards -> Check for Updates -> Install updates

Also, running the following commands from the command line will fix a permissions problem that was introduced in this version

Code: Select all

chown apache:nagios /etc/mrtg -R
chmod 775 /etc/mrtg -R
chown apache:nagios /var/lib/mrtg -R
chmod 775 /var/lib/mrtg -R
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
TBT
Posts: 625
Joined: Wed May 18, 2011 1:26 pm

Re: MRTG consumes 100% of system resources

Post by TBT »

scottwilkerson wrote:There was a bug introduced in the Switch Wizard and it should be updated
Admin -> Manage Config Wizards -> Check for Updates -> Install updates

Also, running the following commands from the command line will fix a permissions problem that was introduced in this version

Code: Select all

chown apache:nagios /etc/mrtg -R
chmod 775 /etc/mrtg -R
chown apache:nagios /var/lib/mrtg -R
chmod 775 /var/lib/mrtg -R

As mentioned previously, we've done this.

Edit: Network Switch / Router wizard is already at v2.4.1
Nagios XI 2024R2.2.1 (8 Servers)
Nagios Fusion 2024R1.0.2
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: MRTG consumes 100% of system resources

Post by scottwilkerson »

TBT wrote:As mentioned previously, we've done this.
Sorry, I read too fast.

Can you run the following and see if you get any errors

Code: Select all

 LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg --lock-file /var/lib/mrtg/mrtg.lock --confcache-file /var/lib/mrtg/mrtg.ok --user=nagios --group=nagios
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
TBT
Posts: 625
Joined: Wed May 18, 2011 1:26 pm

Re: MRTG consumes 100% of system resources

Post by TBT »

scottwilkerson wrote:
TBT wrote:As mentioned previously, we've done this.
Sorry, I read too fast.

Can you run the following and see if you get any errors

Code: Select all

 LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg --lock-file /var/lib/mrtg/mrtg.lock --confcache-file /var/lib/mrtg/mrtg.ok --user=nagios --group=nagios
1. That appears to be the same line from within the cron, which we've ran manually as well. It hangs, reproducing the issue.

2. We've also ran with the debug option, resulting in the following:
2018-12-05 10:15:16 -- --fork: Child 0 (31223) waiting to deliver
2018-12-05 10:15:16 -- --fork: Parent reading child 0

3. Also noticed that the /var/lib/mrtg/mrtg.ok files isn't being recreated after we've manually removed it.
Nagios XI 2024R2.2.1 (8 Servers)
Nagios Fusion 2024R1.0.2
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: MRTG consumes 100% of system resources

Post by scottwilkerson »

What are the permissions on this directory?

Code: Select all

ls -ld /var/lib/mrtg
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
TBT
Posts: 625
Joined: Wed May 18, 2011 1:26 pm

Re: MRTG consumes 100% of system resources

Post by TBT »

scottwilkerson wrote:What are the permissions on this directory?

Code: Select all

ls -ld /var/lib/mrtg
drwxrwxr-x. 2 apache nagios 86016 Dec 5 10:45 /var/lib/mrtg
Nagios XI 2024R2.2.1 (8 Servers)
Nagios Fusion 2024R1.0.2
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: MRTG consumes 100% of system resources

Post by scottwilkerson »

This looks correct, and I cannot replicate the issue.

Can you run the mrtg command without the user/group to see if you get the same result, (this is what was changed in 5.5.7, the addition of user/group)

Code: Select all

LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg --lock-file /var/lib/mrtg/mrtg.lock --confcache-file /var/lib/mrtg/mrtg.ok
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
TBT
Posts: 625
Joined: Wed May 18, 2011 1:26 pm

Re: MRTG consumes 100% of system resources

Post by TBT »

Manually running without User and Group was successful. Timestamp on the files (/var/lib/mrtg) now reflects when ran. Also, the mrtg.lock file is present.

Additionally, we've modified the cron job, removing User and Group, allowing it to run as per schedule. Result was also successful as graphs are updating.

We still don't understand why this affects only 1 of the 9 XI Servers in our environment. Should we modify the cron on all servers and will the User/Group be removed from future XI releases?
Nagios XI 2024R2.2.1 (8 Servers)
Nagios Fusion 2024R1.0.2
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: MRTG consumes 100% of system resources

Post by scottwilkerson »

TBT wrote:Manually running without User and Group was successful. Timestamp on the files (/var/lib/mrtg) now reflects when ran. Also, the mrtg.lock file is present.

Additionally, we've modified the cron job, removing User and Group, allowing it to run as per schedule. Result was also successful as graphs are updating.

We still don't understand why this affects only 1 of the 9 XI Servers in our environment. Should we modify the cron on all servers and will the User/Group be removed from future XI releases?
Glad to hear that removing that resolved the issue, but frankly I don't know why it did. The addition of the user/group to the cron to for a security vulnerability, although upgrading the Wizard to the latest may also mitigate that as well for future runs.

We will not be removing the user/group in the future, if the wizards is updated on all server I would say it is ok to change the cron on all of them.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Locked