High CPU

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
mtkaschools
Posts: 58
Joined: Tue Sep 14, 2010 7:53 am

High CPU

Post by mtkaschools »

I too can't seem to shake the local host current load from being flagged running Nagios XI 2011R1.2. I have about 300 hosts and 1250 services being monitored. We keep throwing more RAM and CPU at it, but it just sucks it all up and basically says 'thanks', then triggers that error again about the current load.

We currently have 4 processors and 8GB of RAM in a virtual machine feeding this XI environment. We currently have roughly 300 hosts and 1300 services.

Any thoughts?
rdedon
Posts: 578
Joined: Sat Nov 20, 2010 4:51 pm

Re: High CPU

Post by rdedon »

Could you post the relevant info regarding your setup here?
From XI Staff To Our Customers,

In order to give you the best support possible, we ask that you submit your support requests with the following guidelines. These guidelines are intended to reduce resolution time to your issues.

For all support requests, we need to know:
  1. Linux Distribution and version?
  2. 32 or 64bit?
  3. VMware Image or Manual Install of XI?
  4. Are there specials configurations on your system, ie; is Gnome installed? Are you using a proxy? Are you using SSL?
  5. **If you are encountering multiple issues that may not be related, start a thread for each issue
For Installation Issues:
  • Above information
  • For Redhat installs, you need to be registered with the RHN (Redhat Network) in order to have full access to their repos. XI will not be able to install correctly without full repo access since several critical packages depend on this.
  • Any error output noticed during installation, and what scripts were being run when errors were noticed.
  • Verify that both mysql and postgresql are installed and running after the "3-dbservers" script. Send us the output from the following commands:

Code: Select all

service mysqld restart
service postgresql restart 
Thank you.
Rene deDon
Technical Team
___
Nagios Enterprises, LLC
Web: http://www.nagios.com
r.jaynes
Posts: 58
Joined: Wed May 19, 2010 1:27 pm

Re: High CPU

Post by r.jaynes »

We have the same issue. Our host load will report critical sometimes, and the SMTP service for localhost will show a socket timeout. Prior to the 2011 intsall we didn't have this issue.

Relevant information:

NagiosXI 2011R1.1
Linux 2.6.18-194.11.3.el5 #1 SMP Mon Aug 30 16:23:24 EDT 2010 i686 i686 i386 GNU/Linux, installed from
32 Bit
VMWare Image downloaded from the NagsiosXI downloads
None special configs, gnome, no proxy, no SSL, basically a default install

38 hosts
180 services
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: High CPU

Post by mguthrie »

mtkaschools,

Here's a Doc on understanding what affects XI performance:
http://library.nagios.com/library/produ ... erformance

It should probably be noted that mysql is fairly opportunistic with RAM, and it will cache as much memory as it can (up to about 90-95%) if it's not already in use. It's not all actually being used, but think of it as being on standby as needed. CPU load is mostly affect by the factors mentioned in the Doc above.

r.jaynes,

Your issue may be different, you don't have a very large amount of checks running. Can you give us more detail as to where you're finding out about the SMTP timeout, what kind of CPU load on a 15mn average, and how much CPU power you have?
r.jaynes
Posts: 58
Joined: Wed May 19, 2010 1:27 pm

Re: High CPU

Post by r.jaynes »

The SMTP timeout shows up under the localhost services in NagiosXI, as a critical status. Currently the 15 minute average is 3.06. The VM is running on VMware ESX 4.0, and has 2048mhz assigned to it with 512mb of RAM.
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: High CPU

Post by mguthrie »

I'll have to do some hunting to see if there were any relevant updates that might affect SMTP on your system. Your CPU load does seem high for that amount of checks. I'd like to have you try a few things.

Lets make sure there are no corrupted tables in mysql:
http://library.nagios.com/library/produ ... i-database

Lets keep the tables trimmed so they don't bug down the system:
http://library.nagios.com/library/produ ... timization

We made some important updates and bug fixes in 2011R1.2, so I definitely recommend upgrading to that when you're able.

Restart the server.

Let me know if you see any changes on your system.
r.jaynes
Posts: 58
Joined: Wed May 19, 2010 1:27 pm

Re: High CPU

Post by r.jaynes »

Thank you for the help. This morning I've done the following:

1) Upgraded to 2011R1.2
2) Rebooted the VM
3) Repaired the tables per the PDF you linked (did not truncate yet)
4) Checked the values for the Performance->Database settings. They are all default and match the guide, except for the one in the guide that says "Repair Interval: 0". Our default value for this is "Optimize Interval: 60". Is that the same thing?
5) Noticed that VMware tools had not been configured. Ran the config utility, let it install the vmxnet driver, etc.
6) Upgraded the version of VMware tools to the latest version (successful upgrade, no issues)
7) Went back to the tables repair guide, truncated the two tables listed, and reran the repair script

Currently NagiosXI is sitting at 3.27 1-min, 3.38 5-min, 3.33 15-min.

While writing this post, I noticed the load average go high:

Code: Select all

Host	Service	Status	Duration	Attempt	Last Check	Status Information
localhost	Current Load	Critical	10m 36s	4/4	2011-05-06 11:48:58	CRITICAL - load average: 10.09, 11.08, 7.94
            SMTP	Critical	5m 8s	5/5	2011-05-06 11:48:32	Connection refused
r.jaynes
Posts: 58
Joined: Wed May 19, 2010 1:27 pm

Re: High CPU

Post by r.jaynes »

One more thing, I'm not sure if this is relevant or not but the user "postgres" currently has 26 running processes for the command "postmaster".
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: High CPU

Post by mguthrie »

We had another user reporting something similar, I had him try the following, and it seemed to have positive results. This is pulled from the following thread (solution towards the end).

http://support.nagios.com/forum/viewtop ... =16&t=1494

Lets try running some queries against the postgresql data base and see if anything stalls out. I'm suspicious there is damage in the somewhere in the postgres database, but it's hard to say for sure. As of yet we haven't had this issue reported by anyone else and we haven't ever been able to replicate it, so it's hard to pin point it exactly. Try running the below queries, and take note of any error messages, or if any of the queries take more than 2 or 3 seconds.

Code: Select all

psql nagiosxi nagiosxi
\d                                
select count(*) from xi_commands;
select count(*) from xi_events;
select count(*) from xi_meta;
select count(*) from xi_options;
select count(*) from xi_sysstat;
select count(*) from xi_usermeta;
select count(*) from xi_users;
The maintenance and cleaning commands are below, you can try running these as well. You'll get some warnings about not having permissions to some of the built-in postgres tables (those are normal), but post any error messages that might imply table damage or corruption.

Code: Select all

vacuum;
vacuum analyze;
vacuum full;
r.jaynes
Posts: 58
Joined: Wed May 19, 2010 1:27 pm

Re: High CPU

Post by r.jaynes »

I ran all of the commands manually, and nothing seemed to take too long, including the vacuum commands (I did receive the warnings as you noted). Next, I ran all of the commands through "time", for example:

Code: Select all

time psql nagiosxi nagiosxi -c "select count(*) from xi_users;"
Here's the output of the select commands:

Code: Select all

[root@monitor mail]# time psql nagiosxi nagiosxi -c "\d"
                     List of relations
 Schema |            Name             |   Type   |  Owner   
--------+-----------------------------+----------+----------
 public | if_command_id_seq           | sequence | nagiosxi
 public | if_meta_id_seq              | sequence | nagiosxi
 public | if_option_id_seq            | sequence | nagiosxi
 public | if_sysstat_id_seq           | sequence | nagiosxi
 public | if_user_id_seq              | sequence | nagiosxi
 public | if_usermeta_id_seq          | sequence | nagiosxi
 public | xi_commands                 | table    | nagiosxi
 public | xi_commands_command_id_seq  | sequence | nagiosxi
 public | xi_events                   | table    | nagiosxi
 public | xi_events_event_id_seq      | sequence | nagiosxi
 public | xi_meta                     | table    | nagiosxi
 public | xi_meta_meta_id_seq         | sequence | nagiosxi
 public | xi_options                  | table    | nagiosxi
 public | xi_options_option_id_seq    | sequence | nagiosxi
 public | xi_sysstat                  | table    | nagiosxi
 public | xi_sysstat_sysstat_id_seq   | sequence | nagiosxi
 public | xi_usermeta                 | table    | nagiosxi
 public | xi_usermeta_usermeta_id_seq | sequence | nagiosxi
 public | xi_users                    | table    | nagiosxi
 public | xi_users_user_id_seq        | sequence | nagiosxi
(20 rows)


real	0m0.036s
user	0m0.004s
sys	0m0.012s

[root@monitor mail]# time psql nagiosxi nagiosxi -c "select count(*) from xi_commands;"
 count 
-------
     1
(1 row)


real	0m0.284s
user	0m0.003s
sys	0m0.011s
[root@monitor mail]# time psql nagiosxi nagiosxi -c "select count(*) from xi_events;"
 count 
-------
   845
(1 row)


real	0m0.041s
user	0m0.006s
sys	0m0.011s
[root@monitor mail]# time psql nagiosxi nagiosxi -c "select count(*) from xi_meta;"
 count 
-------
   877
(1 row)


real	0m0.037s
user	0m0.004s
sys	0m0.014s
[root@monitor mail]# time psql nagiosxi nagiosxi -c "select count(*) from xi_options;"
 count 
-------
    37
(1 row)


real	0m0.029s
user	0m0.006s
sys	0m0.009s
[root@monitor mail]# time psql nagiosxi nagiosxi -c "select count(*) from xi_sysstat;"
 count 
-------
    16
(1 row)


real	0m0.029s
user	0m0.005s
sys	0m0.012s
[root@monitor mail]# time psql nagiosxi nagiosxi -c "select count(*) from xi_usermeta;"
 count 
-------
   250
(1 row)


real	0m0.113s
user	0m0.005s
sys	0m0.011s
[root@monitor mail]# time psql nagiosxi nagiosxi -c "select count(*) from xi_users;"
 count 
-------
     9
(1 row)


real	0m0.026s
user	0m0.003s
sys	0m0.010s
Here's the output of the vacuum time (second run):

Code: Select all

[root@monitor mail]# time psql nagiosxi nagiosxi -c "vacuum;"
WARNING:  skipping "pg_authid" --- only table or database owner can vacuum it
WARNING:  skipping "pg_tablespace" --- only table or database owner can vacuum it
WARNING:  skipping "pg_pltemplate" --- only table or database owner can vacuum it
WARNING:  skipping "pg_shdepend" --- only table or database owner can vacuum it
WARNING:  skipping "pg_auth_members" --- only table or database owner can vacuum it
WARNING:  skipping "pg_database" --- only table or database owner can vacuum it
VACUUM

real	0m0.783s
user	0m0.005s
sys	0m0.008s
[root@monitor mail]# time psql nagiosxi nagiosxi -c "vacuum analyze;"
WARNING:  skipping "pg_authid" --- only table or database owner can vacuum it
WARNING:  skipping "pg_tablespace" --- only table or database owner can vacuum it
WARNING:  skipping "pg_pltemplate" --- only table or database owner can vacuum it
WARNING:  skipping "pg_shdepend" --- only table or database owner can vacuum it
WARNING:  skipping "pg_auth_members" --- only table or database owner can vacuum it
WARNING:  skipping "pg_database" --- only table or database owner can vacuum it
VACUUM

real	0m0.573s
user	0m0.006s
sys	0m0.011s
[root@monitor mail]# time psql nagiosxi nagiosxi -c "vacuum full;"
WARNING:  skipping "pg_authid" --- only table or database owner can vacuum it
WARNING:  skipping "pg_tablespace" --- only table or database owner can vacuum it
WARNING:  skipping "pg_pltemplate" --- only table or database owner can vacuum it
WARNING:  skipping "pg_shdepend" --- only table or database owner can vacuum it
WARNING:  skipping "pg_auth_members" --- only table or database owner can vacuum it
WARNING:  skipping "pg_database" --- only table or database owner can vacuum it
VACUUM

real	0m0.206s
user	0m0.002s
sys	0m0.015s
Running the vacuum commands a second time were visibly faster than the first time. There are still 26 "postmaster" commands listed in top, and currently my load average is "load average: 4.83, 4.19, 3.83".
Locked