NPCD errors

Post by **jsmurphy** » Tue Jan 17, 2012 6:44 pm

Hey guys,

I've been having some trouble after a large host import I did yesterday (about 2000 new hosts) I've worked through most of them but I'm having trouble getting NPCD to process properly the npcd.log is filled with...

[01-17-2012 23:29:34] NPCD: Processing file 'host-perfdata.1326842967'
[01-17-2012 23:29:34] NPCD: ThreadCounter 1/5 File is service-perfdata.1326842967
[01-17-2012 23:29:34] NPCD: Regular File: service-perfdata.1326842967
[01-17-2012 23:29:34] NPCD: A thread was started on thread_counter = 1
[01-17-2012 23:29:34] NPCD: Have to wait: Filecounter = 2 - thread_counter = 2
[01-17-2012 23:29:34] NPCD: Processing file service-perfdata.1326842967 with ID -1227883664 - going to exec /usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//service-perfdata.1326842967
[01-17-2012 23:29:34] NPCD: Processing file 'service-perfdata.1326842967'
[01-17-2012 23:29:39] NPCD: ERROR: Executed command exits with return code '7'
[01-17-2012 23:29:39] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//host-perfdata.1326842967'
[01-17-2012 23:29:39] NPCD: No more files to process... waiting for 15 seconds

And perfdata.log...

2012-01-17 23:29:39 [23237] [0] *** TIMEOUT: Timeout after 5 secs. ***
2012-01-17 23:29:39 [23237] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2012-01-17 23:29:39 [23237] [0] *** TIMEOUT: Please check your npcd.cfg
2012-01-17 23:29:39 [23237] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//host-perfdata.1326842967-PID-23237 deleted
2012-01-17 23:29:39 [23237] [0] *** Timeout while processing Host: "hostnamehere" Service: "_HOST_"
2012-01-17 23:29:39 [23237] [0] *** process_perfdata.pl terminated on signal ALRM

On an unrelated note, exporting config files via nagql takes one hell of a long time now and I had to increase the timeout in /etc/php.ini... is it necessary to export all of the files instead of just those that aren't synchronized? Could you not create a separate option for full export and have the default as an incremental export of updated entries?

mguthrie · Post by **mguthrie** » Wed Jan 18, 2012 10:39 am

You could try turning down the sleep time in between directory scans for the npcd daemon. Try:

Code: Select all

sleep_time=10

in /usr/local/nagios/etc/pnp/npcd.cfg
and then:

Code: Select all

service npcd restart

and see if that helps.

Could you not create a separate option for full export and have the default as an incremental export of updated entries?

As we get more and more large installations, this does seem like a good idea. The main time this is problematic is when there's a massive config change or import, like you described above. Does it continue to take a very long time to write out the files even after that initial import is applied?

Post by **jsmurphy** » Wed Jan 18, 2012 5:24 pm

The initial import did take a fair while but that was to be expected... the exports afterwards are the problem, especially when you make a one line change and it takes roughly just over a minute to do the write. Though I have noticed a bug where changing certain single fields doesn't set the object to out-of-synch (parents I think is one of them).

Having re-read my first post I realized I never actually stated what the problem is with NPCD; it's not processing performance data at all... it's picking up the file and then failing with exit code 7 then removing the file. I've tried restarting npcd and playing with the timeout periods but that doesn't seem to have accomplished much and unfortunately my knowledge of rrd/pnp/etc is a little bit thinner than it is with the rest of the app.

mguthrie · Post by **mguthrie** » Thu Jan 19, 2012 10:20 am

Though I have noticed a bug where changing certain single fields doesn't set the object to out-of-synch (parents I think is one of them).

If you isolate the steps to reproduce this, can you post the bug to our tracker? (tracker.nagios.com).

The reason we do the full write of all config files at once, is that it allows XI to create the working check points, so if a bad config gets applied, it can just roll back and always keep the monitoring engine running. As I dive back into the CCM I can certainly look and see if anything can be optimized with the config write.

the problem is with NPCD; it's not processing performance data at all... it's picking up the file and then failing with exit code 7 then removing the file.

We have seen this one before, but we haven't ever been able to isolate the root cause of it. In some cases the problem has been with permissions.
chmod -R +x /usr/local/nagios/share/perfdata

Also, take a look at the /usr/local/nagios/var directory, and make sure everything is owned nagios.nagios, and permissions are at least 0664 on all files.

Do you have a massive amount of results in /usr/local/nagios/var/spool/perfdata?

Do you have a massive amount of results in /usr/local/nagios/var/service-perfdata?

Post by **jsmurphy** » Thu Jan 19, 2012 5:01 pm

Seems to be resolved now, I ran the permissions change and restarted the npcd service again and it seems to have all come back good... don't know how that would have happened I've never touched the share/perfdata dir. Oh well it's solved now, thanks!

I'll keep a close eye on the CCM stuff and submit a bug report once I know a little more precisely which fields cause the behaviour.

Could you not still write out changed configs and take a full directory check point so you still have those consistent full state snap-shots? Just an idea at any rate, I foresee trouble in my future when I open up administration to more of our engineers and they each want to make independent changes around the same time

mguthrie · Post by **mguthrie** » Fri Jan 20, 2012 11:29 am

I came across the same issue you experienced on my large-install test box. I took a few more drastic measure that I wouldn't recommend for a production environment, but I did end up coming across the idea of increasing the timeout setting in the /usr/local/nagios/etc/pnp/process_perfdata.cfg file. I think this happens when the perfdata gets backed up, the disk gets really busy, or the CPU load gets pretty high. Glad it's working again though!

Could you not still write out changed configs and take a full directory check point so you still have those consistent full state snap-shots? Just an idea at any rate, I foresee trouble in my future when I open up administration to more of our engineers and they each want to make independent changes around the same time

I will take a look at the code and what we can do here. If you want to discuss solutions to the issue with a large install with many engineers working on a single XI instance, try sending a private message to neibais. They have about 80 users on a single XI install, with about 25 actively using it at the same time.

Nagios Support Forum

NPCD errors

NPCD errors

Re: NPCD errors

Re: NPCD errors

Re: NPCD errors

Re: NPCD errors

Re: NPCD errors