Halp! All My Graphs have Stopped!

sreinhardt · Post by **sreinhardt** » Mon Jul 07, 2014 11:01 am

Well, I don't see anything glaringly wrong there, but we certainly could do a few improvements for you. Let's start by timing the current run of mtrg and send me the full output please.

Code: Select all

time LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg 2>&1 | tee -a /tmp/mrtg.log

BenGatewood · Post by **BenGatewood** » Mon Jul 07, 2014 11:04 am

OK. I'll do that now. Also, I just spotted a bunch of these in the nagios.log:

[1404748050] wproc: 'Core Worker 7178' seems to be choked. ret = -1; bufsize = 5162: errno = 11 (Resource temporarily unavailable)

BenGatewood · Post by **BenGatewood** » Mon Jul 07, 2014 11:48 am

real 12m27.686s
user 0m27.917s
sys 0m1.643s

I also got many:

ERROR: Target[x.x.x.x_x][_OUT_] ' $target->[10185]{$mode} ' did not eval into defined data

But they seem to be for services that XI claims don't exist :-/

sreinhardt · Post by **sreinhardt** » Mon Jul 07, 2014 2:58 pm

It's entirely possible that some mrtg additions were not removed with service removal. I would highly suggest removing those configs from /etc/mrtg/conf.d/.
Also add "Forks: 4" to your main mrtg.cfg. This will let mrtg split up into 4 forks for processing faster. It will add a bit of load, but it should be hardly noticable and should allow for your mrtg to finish within the 5 minute window it should instead of the 12 min currently.

BenGatewood · Post by **BenGatewood** » Mon Jul 07, 2014 4:21 pm

Thanks, Spenser. I've added the Forks config and will clean up the ghost services again tomorrow and let you know how I get on.

OK. So I added the Forks config last night and my BW checks started reporting proper values and seem to be stable but the graphs were still not displaying. I cleaned out the redundant mrtg entries this morning and that made no difference. I disabled rrdcache and they started working :-/

I'm still getting NPCD timeouts in the log though so I'm not sure everything is fixed yet though.

Post by **lmiltchev** » Tue Jul 08, 2014 8:05 am

I believe you haven't modified the "default" timeout value in the "process_perfdata.cfg". To view the current value, run:

Code: Select all

grep TIMEOUT /usr/local/nagios/etc/pnp/process_perfdata.cfg

Open the "process_perfdata.cfg" in a text editor, Increase the timeout value, save, exit, and restart npcd.

Code: Select all

service npcd restart

See if this is going to fix your problem.

BenGatewood · Post by **BenGatewood** » Tue Jul 08, 2014 8:35 am

It was set to '80' and I have increased it to '100'

Is there any way to find out why NPCD is timing out? Surely it shouldn't be doing that, right?

Too good to last. Number of files in the perfdata spool hit 1000 whilst I was in a meeting and all graphing has stopped again

Argh - I don't think NPCD is processing any perfdata successfully anymore. The files in the spool just keep going up. Still complaining about load thresholds and timeouts but I have increased both of them significantly (again). I really don't understand what's choking this up. I've got 311 hosts and 5985 services on a box with 8 vCPUs and 16GB of RAM - is this too much? I really need to get this working.

sreinhardt · Post by **sreinhardt** » Tue Jul 08, 2014 1:48 pm

I disabled rrdcache and they started working :-/

Sounds like you likely incorrectly implemented or didn't quite finish the install for rrdcached. Most of the time that is the case for rrdcached issues.

Are you getting timeouts again or load thresholds being met? Can you send one of us a profile.zip so we can get some more diagnostics as to what might be causing the, likely, additional load or io wait?

BenGatewood · Post by **BenGatewood** » Tue Jul 08, 2014 5:00 pm

I'm not sure how I could have incorrectly implemented or not finished the install for rrdcached. I downloaded and ran your install script which completed without any errors. What else do I have to do to get it working? The documentation doesn't list any other steps.

I am getting both timeouts *and* load thresholds in the npcd log. It seems like a bunch of the npcd threads lock up at the same time before timing out then the process repeats.

I'll PM you the profile.zip now.

abrist · Post by **abrist** » Tue Jul 08, 2014 5:24 pm

Spenser is out for the day, but will be back tomorrow. You may want to check your nagios.cfg for the rrdcached line, make sure it is commented.

Nagios Support Forum

Halp! All My Graphs have Stopped!

Re: Halp! All My Graphs have Stopped!

Re: Halp! All My Graphs have Stopped!

Re: Halp! All My Graphs have Stopped!

Re: Halp! All My Graphs have Stopped!

Re: Halp! All My Graphs have Stopped!

Re: Halp! All My Graphs have Stopped!

Re: Halp! All My Graphs have Stopped!

Re: Halp! All My Graphs have Stopped!

Re: Halp! All My Graphs have Stopped!

Re: Halp! All My Graphs have Stopped!