Tactical Overview, Ops Center, Ops Screen Problems

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Re: Tactical Overview, Ops Center, Ops Screen Problems

Post by jbennett »

mguthrie wrote:Yes, but also check why it was filled up. If you're also using it for the perfdata and checkresult spool, those files can fill up the RAM disk if either the Nagios daemon or NPCD daemon aren't running.
My service-perfdata is at 53M and host-perfdata is at 16M.

when I check nagios and npcd services, they both show to be running (with a given pid).

I just followed the document on how to utilize the ram disk. Is this not correct? What would be causing these two to fill up so much?
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Tactical Overview, Ops Center, Ops Screen Problems

Post by mguthrie »

You did the setup correctly, but there's something wrong with performance data processing if those files are that big.

Are your performance graphs up to date?

Are you getting timeout or CPU load errors in /usr/local/nagios/var/perfdata.log or npcd.log?
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Re: Tactical Overview, Ops Center, Ops Screen Problems

Post by jbennett »

mguthrie wrote:You did the setup correctly, but there's something wrong with performance data processing if those files are that big.

Are your performance graphs up to date?

Are you getting timeout or CPU load errors in /usr/local/nagios/var/perfdata.log or npcd.log?
When I check the host graphs, they don't actually come up. When I click on a specific host graph, I see the following:

You are not authorized to access this feature. Contact your Nagios XI administrator for more information, or to obtain access to this feature.

I am logged in as the system admin as well.

The last entry in perfdata.log is from about a month ago.

Code: Select all

2013-01-22 17:53:05 [7117] [0] *** Timeout while processing Host: "example host" Service: "_HOST_"
2013-01-22 17:53:05 [7117] [0] *** process_perfdata.pl terminated on signal ALRM
2013-01-25 16:23:00 [8392] [0] *** TIMEOUT: Timeout after 5 secs. ***
2013-01-25 16:23:00 [8392] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2013-01-25 16:23:00 [8392] [0] *** TIMEOUT: Please check your npcd.cfg
2013-01-25 16:23:00 [8392] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//host-perfdata.1358703239-PID-8392 deleted
However, npcd.log is much more recent, with the following:

Code: Select all

[02-26-2013 09:43:49] NPCD: npcd Daemon (0.4.14) started with PID=3382
[02-26-2013 09:43:49] NPCD: Please have a look at 'npcd -V' to get license information
[02-26-2013 09:43:49] NPCD: HINT: load_threshold is enabled - ('10.000000')
[02-26-2013 10:06:49] NPCD: WARN: MAX load reached: load 10.050000/10.000000 at i=0
I'm lost as to where to check next, other than to increase the max load limit? What would be causing me to reach that max load in the first place?
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Tactical Overview, Ops Center, Ops Screen Problems

Post by abrist »

To fix the timeout and load errors, edit the file:

Code: Select all

/usr/local/naigos/etc/pnp/process_perfdata.cfg
Change:

Code: Select all

TIMEOUT = 5
To:

Code: Select all

TIMEOUT = 10
Also edit this file:

Code: Select all

/usr/local/nagios/etc/pnp/npcd.cfg
Change:

Code: Select all

load_threshold = 10.0
To:

Code: Select all

load_threshold = 20.0
Now restart npcd:

Code: Select all

service npcd stop
killall -9 npcd
service npcd start
jbennett wrote: I'm lost as to where to check next, other than to increase the max load limit? What would be causing me to reach that max load in the first place?
The max load default setting is fairly conservative. If you have multiple cores and a good amount of memory, it can be set much higher. If you are doing thousands of checks, or cpu intensive checks, it will need to be set higher.
When I check the host graphs, they don't actually come up. When I click on a specific host graph, I see the following:

You are not authorized to access this feature. Contact your Nagios XI administrator for more information, or to obtain access to this feature.
It could be a permissions problem, so lets check the perfdata directory's permissions:

Code: Select all

ll /usr/local/nagios/share/perfdata
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Tactical Overview, Ops Center, Ops Screen Problems

Post by mguthrie »

I would start with increasing that load threshold to 20.0, and then lets do a few things to see if we can get things back to normal.

Code: Select all

rm-f /var/nagiosramdisk/host-perfdata
rm-f /var/nagiosramdisk/service-perfdata
service nagios restart
Those files will get regenerated, but that should stop the system from snowballing.

Have you done any manual updates to PNP on your system?

Can you post your nagios.cfg file?
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Re: Tactical Overview, Ops Center, Ops Screen Problems

Post by jbennett »

I have done all that you suggested above. In the process, I wanted to just repair the DB just as a precaution, when I tried run the following commands, I ran into an error about the max number of connections being reached for mysql:

mysql -u ndoutils -pn@gweb nagios -e 'TRUNCATE TABLE nagios_logentries'
mysql -u ndoutils -pn@gweb nagios -e 'TRUNCATE TABLE nagios_notifications'

I didn't currently this set in my.cnf. I have since raised it to 150. I'm wondering if this message is due to the other issues you just mentioned or if I actually should raise the max connections?

I have about 1700 hosts and about 4300 service checks on this instance of Nagios.

When I ll the files in teh /usr/local/nagios/share/perfdata directory, they all have the following permissions:

Code: Select all

drwxrwxr-x 2 nagios nagios
When I check the permissions on the perfdata directory its self, I have the following:

Code: Select all

drwxrwxr-x 4643 nagios nagios 200704 Jan 18 18:13 perfdata
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Re: Tactical Overview, Ops Center, Ops Screen Problems

Post by jbennett »

mguthrie wrote:I would start with increasing that load threshold to 20.0, and then lets do a few things to see if we can get things back to normal.

Code: Select all

rm-f /var/nagiosramdisk/host-perfdata
rm-f /var/nagiosramdisk/service-perfdata
service nagios restart
Those files will get regenerated, but that should stop the system from snowballing.
I am now down to the following:

tmpfs 75M 15M 61M 20% /var/nagiosramdisk
Have you done any manual updates to PNP on your system?
None that I am aware of, however, I do not have direct control over this VM as it is hosted in our data center. If this is something I need to check on, I will have to ask those guys.
Can you post your nagios.cfg file?
Uploaded as an attachment.
You do not have the required permissions to view the files attached to this post.
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Tactical Overview, Ops Center, Ops Screen Problems

Post by mguthrie »

My guess is that the CPU load caused the performance data to get backed up, which also caused the RAM disk fill up. You might be ok just with the higher performance data threshold. Are you starting to get performance data on your graphs again?
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Re: Tactical Overview, Ops Center, Ops Screen Problems

Post by jbennett »

Unfortunately, I am not. I just get the red X.
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Tactical Overview, Ops Center, Ops Screen Problems

Post by mguthrie »

Hmm, that's likely something separate from the other issue. Do you have any errors or notices showing up in the apache log?
Locked