Halp! All My Graphs have Stopped!

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: Halp! All My Graphs have Stopped!

Post by sreinhardt »

Well, I don't see anything glaringly wrong there, but we certainly could do a few improvements for you. Let's start by timing the current run of mtrg and send me the full output please.

Code: Select all

time LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg 2>&1 | tee -a /tmp/mrtg.log
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
BenGatewood
Posts: 35
Joined: Fri May 16, 2014 5:17 am

Re: Halp! All My Graphs have Stopped!

Post by BenGatewood »

OK. I'll do that now. Also, I just spotted a bunch of these in the nagios.log:

[1404748050] wproc: 'Core Worker 7178' seems to be choked. ret = -1; bufsize = 5162: errno = 11 (Resource temporarily unavailable)
BenGatewood
Posts: 35
Joined: Fri May 16, 2014 5:17 am

Re: Halp! All My Graphs have Stopped!

Post by BenGatewood »

real 12m27.686s
user 0m27.917s
sys 0m1.643s

I also got many:

ERROR: Target[x.x.x.x_x][_OUT_] ' $target->[10185]{$mode} ' did not eval into defined data

But they seem to be for services that XI claims don't exist :-/
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: Halp! All My Graphs have Stopped!

Post by sreinhardt »

It's entirely possible that some mrtg additions were not removed with service removal. I would highly suggest removing those configs from /etc/mrtg/conf.d/.
Also add "Forks: 4" to your main mrtg.cfg. This will let mrtg split up into 4 forks for processing faster. It will add a bit of load, but it should be hardly noticable and should allow for your mrtg to finish within the 5 minute window it should instead of the 12 min currently.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
BenGatewood
Posts: 35
Joined: Fri May 16, 2014 5:17 am

Re: Halp! All My Graphs have Stopped!

Post by BenGatewood »

Thanks, Spenser. I've added the Forks config and will clean up the ghost services again tomorrow and let you know how I get on.

OK. So I added the Forks config last night and my BW checks started reporting proper values and seem to be stable but the graphs were still not displaying. I cleaned out the redundant mrtg entries this morning and that made no difference. I disabled rrdcache and they started working :-/

I'm still getting NPCD timeouts in the log though so I'm not sure everything is fixed yet though.
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Halp! All My Graphs have Stopped!

Post by lmiltchev »

I believe you haven't modified the "default" timeout value in the "process_perfdata.cfg". To view the current value, run:

Code: Select all

grep TIMEOUT /usr/local/nagios/etc/pnp/process_perfdata.cfg
Open the "process_perfdata.cfg" in a text editor, Increase the timeout value, save, exit, and restart npcd.

Code: Select all

service npcd restart
See if this is going to fix your problem.
Be sure to check out our Knowledgebase for helpful articles and solutions!
BenGatewood
Posts: 35
Joined: Fri May 16, 2014 5:17 am

Re: Halp! All My Graphs have Stopped!

Post by BenGatewood »

It was set to '80' and I have increased it to '100'

Is there any way to find out why NPCD is timing out? Surely it shouldn't be doing that, right?

Too good to last. Number of files in the perfdata spool hit 1000 whilst I was in a meeting and all graphing has stopped again :(

Argh - I don't think NPCD is processing any perfdata successfully anymore. The files in the spool just keep going up. Still complaining about load thresholds and timeouts but I have increased both of them significantly (again). I really don't understand what's choking this up. I've got 311 hosts and 5985 services on a box with 8 vCPUs and 16GB of RAM - is this too much? I really need to get this working.
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: Halp! All My Graphs have Stopped!

Post by sreinhardt »

I disabled rrdcache and they started working :-/
Sounds like you likely incorrectly implemented or didn't quite finish the install for rrdcached. Most of the time that is the case for rrdcached issues.

Are you getting timeouts again or load thresholds being met? Can you send one of us a profile.zip so we can get some more diagnostics as to what might be causing the, likely, additional load or io wait?
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
BenGatewood
Posts: 35
Joined: Fri May 16, 2014 5:17 am

Re: Halp! All My Graphs have Stopped!

Post by BenGatewood »

I'm not sure how I could have incorrectly implemented or not finished the install for rrdcached. I downloaded and ran your install script which completed without any errors. What else do I have to do to get it working? The documentation doesn't list any other steps.

I am getting both timeouts *and* load thresholds in the npcd log. It seems like a bunch of the npcd threads lock up at the same time before timing out then the process repeats.

I'll PM you the profile.zip now.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Halp! All My Graphs have Stopped!

Post by abrist »

Spenser is out for the day, but will be back tomorrow. You may want to check your nagios.cfg for the rrdcached line, make sure it is commented.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Locked