MRTG consumes 100% of system resources
MRTG consumes 100% of system resources
It appears when cron runs MRTG, the process hangs and re-spawns, eventually consuming 100% of system resources. This was introduced after an upgrade to XI 5.5.7 and oddly enough, only affects 1 of our 9 XI servers. Time stamp on the rrd files in /var/lib/mrtg are not updating, we also checked file permissions and ownership on /etc/mrtg and /var/lib/mrtg (mentioned in another thread). No errors present in /var/log/messages.
CentOS 6.10
rrdtool-1.3.8-7.el6.x86_64
glib2-2.28.8-10.el6.x86_64
Any insight?
CentOS 6.10
rrdtool-1.3.8-7.el6.x86_64
glib2-2.28.8-10.el6.x86_64
Any insight?
Nagios XI 2024R2.2.1 (8 Servers)
Nagios Fusion 2024R1.0.2
Nagios Fusion 2024R1.0.2
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: MRTG consumes 100% of system resources
There was a bug introduced in the Switch Wizard and it should be updated
Admin -> Manage Config Wizards -> Check for Updates -> Install updates
Also, running the following commands from the command line will fix a permissions problem that was introduced in this version
Admin -> Manage Config Wizards -> Check for Updates -> Install updates
Also, running the following commands from the command line will fix a permissions problem that was introduced in this version
Code: Select all
chown apache:nagios /etc/mrtg -R
chmod 775 /etc/mrtg -R
chown apache:nagios /var/lib/mrtg -R
chmod 775 /var/lib/mrtg -RRe: MRTG consumes 100% of system resources
scottwilkerson wrote:There was a bug introduced in the Switch Wizard and it should be updated
Admin -> Manage Config Wizards -> Check for Updates -> Install updates
Also, running the following commands from the command line will fix a permissions problem that was introduced in this version
Code: Select all
chown apache:nagios /etc/mrtg -R chmod 775 /etc/mrtg -R chown apache:nagios /var/lib/mrtg -R chmod 775 /var/lib/mrtg -R
As mentioned previously, we've done this.
Edit: Network Switch / Router wizard is already at v2.4.1
Nagios XI 2024R2.2.1 (8 Servers)
Nagios Fusion 2024R1.0.2
Nagios Fusion 2024R1.0.2
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: MRTG consumes 100% of system resources
Sorry, I read too fast.TBT wrote:As mentioned previously, we've done this.
Can you run the following and see if you get any errors
Code: Select all
LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg --lock-file /var/lib/mrtg/mrtg.lock --confcache-file /var/lib/mrtg/mrtg.ok --user=nagios --group=nagiosRe: MRTG consumes 100% of system resources
1. That appears to be the same line from within the cron, which we've ran manually as well. It hangs, reproducing the issue.scottwilkerson wrote:Sorry, I read too fast.TBT wrote:As mentioned previously, we've done this.
Can you run the following and see if you get any errors
Code: Select all
LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg --lock-file /var/lib/mrtg/mrtg.lock --confcache-file /var/lib/mrtg/mrtg.ok --user=nagios --group=nagios
2. We've also ran with the debug option, resulting in the following:
2018-12-05 10:15:16 -- --fork: Child 0 (31223) waiting to deliver
2018-12-05 10:15:16 -- --fork: Parent reading child 0
3. Also noticed that the /var/lib/mrtg/mrtg.ok files isn't being recreated after we've manually removed it.
Nagios XI 2024R2.2.1 (8 Servers)
Nagios Fusion 2024R1.0.2
Nagios Fusion 2024R1.0.2
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: MRTG consumes 100% of system resources
What are the permissions on this directory?
Code: Select all
ls -ld /var/lib/mrtgRe: MRTG consumes 100% of system resources
drwxrwxr-x. 2 apache nagios 86016 Dec 5 10:45 /var/lib/mrtgscottwilkerson wrote:What are the permissions on this directory?Code: Select all
ls -ld /var/lib/mrtg
Nagios XI 2024R2.2.1 (8 Servers)
Nagios Fusion 2024R1.0.2
Nagios Fusion 2024R1.0.2
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: MRTG consumes 100% of system resources
This looks correct, and I cannot replicate the issue.
Can you run the mrtg command without the user/group to see if you get the same result, (this is what was changed in 5.5.7, the addition of user/group)
Can you run the mrtg command without the user/group to see if you get the same result, (this is what was changed in 5.5.7, the addition of user/group)
Code: Select all
LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg --lock-file /var/lib/mrtg/mrtg.lock --confcache-file /var/lib/mrtg/mrtg.okRe: MRTG consumes 100% of system resources
Manually running without User and Group was successful. Timestamp on the files (/var/lib/mrtg) now reflects when ran. Also, the mrtg.lock file is present.
Additionally, we've modified the cron job, removing User and Group, allowing it to run as per schedule. Result was also successful as graphs are updating.
We still don't understand why this affects only 1 of the 9 XI Servers in our environment. Should we modify the cron on all servers and will the User/Group be removed from future XI releases?
Additionally, we've modified the cron job, removing User and Group, allowing it to run as per schedule. Result was also successful as graphs are updating.
We still don't understand why this affects only 1 of the 9 XI Servers in our environment. Should we modify the cron on all servers and will the User/Group be removed from future XI releases?
Nagios XI 2024R2.2.1 (8 Servers)
Nagios Fusion 2024R1.0.2
Nagios Fusion 2024R1.0.2
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: MRTG consumes 100% of system resources
Glad to hear that removing that resolved the issue, but frankly I don't know why it did. The addition of the user/group to the cron to for a security vulnerability, although upgrading the Wizard to the latest may also mitigate that as well for future runs.TBT wrote:Manually running without User and Group was successful. Timestamp on the files (/var/lib/mrtg) now reflects when ran. Also, the mrtg.lock file is present.
Additionally, we've modified the cron job, removing User and Group, allowing it to run as per schedule. Result was also successful as graphs are updating.
We still don't understand why this affects only 1 of the 9 XI Servers in our environment. Should we modify the cron on all servers and will the User/Group be removed from future XI releases?
We will not be removing the user/group in the future, if the wizards is updated on all server I would say it is ok to change the cron on all of them.