Well, I don't see anything glaringly wrong there, but we certainly could do a few improvements for you. Let's start by timing the current run of mtrg and send me the full output please.
time LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg 2>&1 | tee -a /tmp/mrtg.log
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
It's entirely possible that some mrtg additions were not removed with service removal. I would highly suggest removing those configs from /etc/mrtg/conf.d/.
Also add "Forks: 4" to your main mrtg.cfg. This will let mrtg split up into 4 forks for processing faster. It will add a bit of load, but it should be hardly noticable and should allow for your mrtg to finish within the 5 minute window it should instead of the 12 min currently.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
Thanks, Spenser. I've added the Forks config and will clean up the ghost services again tomorrow and let you know how I get on.
OK. So I added the Forks config last night and my BW checks started reporting proper values and seem to be stable but the graphs were still not displaying. I cleaned out the redundant mrtg entries this morning and that made no difference. I disabled rrdcache and they started working :-/
I'm still getting NPCD timeouts in the log though so I'm not sure everything is fixed yet though.
It was set to '80' and I have increased it to '100'
Is there any way to find out why NPCD is timing out? Surely it shouldn't be doing that, right?
Too good to last. Number of files in the perfdata spool hit 1000 whilst I was in a meeting and all graphing has stopped again
Argh - I don't think NPCD is processing any perfdata successfully anymore. The files in the spool just keep going up. Still complaining about load thresholds and timeouts but I have increased both of them significantly (again). I really don't understand what's choking this up. I've got 311 hosts and 5985 services on a box with 8 vCPUs and 16GB of RAM - is this too much? I really need to get this working.
Sounds like you likely incorrectly implemented or didn't quite finish the install for rrdcached. Most of the time that is the case for rrdcached issues.
Are you getting timeouts again or load thresholds being met? Can you send one of us a profile.zip so we can get some more diagnostics as to what might be causing the, likely, additional load or io wait?
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
I'm not sure how I could have incorrectly implemented or not finished the install for rrdcached. I downloaded and ran your install script which completed without any errors. What else do I have to do to get it working? The documentation doesn't list any other steps.
I am getting both timeouts *and* load thresholds in the npcd log. It seems like a bunch of the npcd threads lock up at the same time before timing out then the process repeats.
Spenser is out for the day, but will be back tomorrow. You may want to check your nagios.cfg for the rrdcached line, make sure it is commented.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.