Page 2 of 6

Re: Tactical Overview, Ops Center, Ops Screen Problems

Posted: Wed Feb 27, 2013 10:48 am
by jbennett
mguthrie wrote:Yes, but also check why it was filled up. If you're also using it for the perfdata and checkresult spool, those files can fill up the RAM disk if either the Nagios daemon or NPCD daemon aren't running.
My service-perfdata is at 53M and host-perfdata is at 16M.

when I check nagios and npcd services, they both show to be running (with a given pid).

I just followed the document on how to utilize the ram disk. Is this not correct? What would be causing these two to fill up so much?

Re: Tactical Overview, Ops Center, Ops Screen Problems

Posted: Wed Feb 27, 2013 10:58 am
by mguthrie
You did the setup correctly, but there's something wrong with performance data processing if those files are that big.

Are your performance graphs up to date?

Are you getting timeout or CPU load errors in /usr/local/nagios/var/perfdata.log or npcd.log?

Re: Tactical Overview, Ops Center, Ops Screen Problems

Posted: Wed Feb 27, 2013 11:07 am
by jbennett
mguthrie wrote:You did the setup correctly, but there's something wrong with performance data processing if those files are that big.

Are your performance graphs up to date?

Are you getting timeout or CPU load errors in /usr/local/nagios/var/perfdata.log or npcd.log?
When I check the host graphs, they don't actually come up. When I click on a specific host graph, I see the following:

You are not authorized to access this feature. Contact your Nagios XI administrator for more information, or to obtain access to this feature.

I am logged in as the system admin as well.

The last entry in perfdata.log is from about a month ago.

Code: Select all

2013-01-22 17:53:05 [7117] [0] *** Timeout while processing Host: "example host" Service: "_HOST_"
2013-01-22 17:53:05 [7117] [0] *** process_perfdata.pl terminated on signal ALRM
2013-01-25 16:23:00 [8392] [0] *** TIMEOUT: Timeout after 5 secs. ***
2013-01-25 16:23:00 [8392] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2013-01-25 16:23:00 [8392] [0] *** TIMEOUT: Please check your npcd.cfg
2013-01-25 16:23:00 [8392] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//host-perfdata.1358703239-PID-8392 deleted
However, npcd.log is much more recent, with the following:

Code: Select all

[02-26-2013 09:43:49] NPCD: npcd Daemon (0.4.14) started with PID=3382
[02-26-2013 09:43:49] NPCD: Please have a look at 'npcd -V' to get license information
[02-26-2013 09:43:49] NPCD: HINT: load_threshold is enabled - ('10.000000')
[02-26-2013 10:06:49] NPCD: WARN: MAX load reached: load 10.050000/10.000000 at i=0
I'm lost as to where to check next, other than to increase the max load limit? What would be causing me to reach that max load in the first place?

Re: Tactical Overview, Ops Center, Ops Screen Problems

Posted: Wed Feb 27, 2013 11:59 am
by abrist
To fix the timeout and load errors, edit the file:

Code: Select all

/usr/local/naigos/etc/pnp/process_perfdata.cfg
Change:

Code: Select all

TIMEOUT = 5
To:

Code: Select all

TIMEOUT = 10
Also edit this file:

Code: Select all

/usr/local/nagios/etc/pnp/npcd.cfg
Change:

Code: Select all

load_threshold = 10.0
To:

Code: Select all

load_threshold = 20.0
Now restart npcd:

Code: Select all

service npcd stop
killall -9 npcd
service npcd start
jbennett wrote: I'm lost as to where to check next, other than to increase the max load limit? What would be causing me to reach that max load in the first place?
The max load default setting is fairly conservative. If you have multiple cores and a good amount of memory, it can be set much higher. If you are doing thousands of checks, or cpu intensive checks, it will need to be set higher.
When I check the host graphs, they don't actually come up. When I click on a specific host graph, I see the following:

You are not authorized to access this feature. Contact your Nagios XI administrator for more information, or to obtain access to this feature.
It could be a permissions problem, so lets check the perfdata directory's permissions:

Code: Select all

ll /usr/local/nagios/share/perfdata

Re: Tactical Overview, Ops Center, Ops Screen Problems

Posted: Wed Feb 27, 2013 12:03 pm
by mguthrie
I would start with increasing that load threshold to 20.0, and then lets do a few things to see if we can get things back to normal.

Code: Select all

rm-f /var/nagiosramdisk/host-perfdata
rm-f /var/nagiosramdisk/service-perfdata
service nagios restart
Those files will get regenerated, but that should stop the system from snowballing.

Have you done any manual updates to PNP on your system?

Can you post your nagios.cfg file?

Re: Tactical Overview, Ops Center, Ops Screen Problems

Posted: Wed Feb 27, 2013 12:13 pm
by jbennett
I have done all that you suggested above. In the process, I wanted to just repair the DB just as a precaution, when I tried run the following commands, I ran into an error about the max number of connections being reached for mysql:

mysql -u ndoutils -pn@gweb nagios -e 'TRUNCATE TABLE nagios_logentries'
mysql -u ndoutils -pn@gweb nagios -e 'TRUNCATE TABLE nagios_notifications'

I didn't currently this set in my.cnf. I have since raised it to 150. I'm wondering if this message is due to the other issues you just mentioned or if I actually should raise the max connections?

I have about 1700 hosts and about 4300 service checks on this instance of Nagios.

When I ll the files in teh /usr/local/nagios/share/perfdata directory, they all have the following permissions:

Code: Select all

drwxrwxr-x 2 nagios nagios
When I check the permissions on the perfdata directory its self, I have the following:

Code: Select all

drwxrwxr-x 4643 nagios nagios 200704 Jan 18 18:13 perfdata

Re: Tactical Overview, Ops Center, Ops Screen Problems

Posted: Wed Feb 27, 2013 12:20 pm
by jbennett
mguthrie wrote:I would start with increasing that load threshold to 20.0, and then lets do a few things to see if we can get things back to normal.

Code: Select all

rm-f /var/nagiosramdisk/host-perfdata
rm-f /var/nagiosramdisk/service-perfdata
service nagios restart
Those files will get regenerated, but that should stop the system from snowballing.
I am now down to the following:

tmpfs 75M 15M 61M 20% /var/nagiosramdisk
Have you done any manual updates to PNP on your system?
None that I am aware of, however, I do not have direct control over this VM as it is hosted in our data center. If this is something I need to check on, I will have to ask those guys.
Can you post your nagios.cfg file?
Uploaded as an attachment.

Re: Tactical Overview, Ops Center, Ops Screen Problems

Posted: Wed Feb 27, 2013 1:08 pm
by mguthrie
My guess is that the CPU load caused the performance data to get backed up, which also caused the RAM disk fill up. You might be ok just with the higher performance data threshold. Are you starting to get performance data on your graphs again?

Re: Tactical Overview, Ops Center, Ops Screen Problems

Posted: Wed Feb 27, 2013 1:11 pm
by jbennett
Unfortunately, I am not. I just get the red X.

Re: Tactical Overview, Ops Center, Ops Screen Problems

Posted: Wed Feb 27, 2013 2:13 pm
by mguthrie
Hmm, that's likely something separate from the other issue. Do you have any errors or notices showing up in the apache log?