Page 1 of 1

Nagios Load + CPU extremely high but why?

Posted: Fri Jan 10, 2014 6:05 am
by fogier
We have a CentOS v6.3 64bit system (appliance on your website) with 1vCPU and 4GB RAM running on vSphere 5.0 Update 2 (+ some patches).
Since several weeks we've got problems with high server load and cpu usage. Nothing (that we can think of) has happened arround that period.
Now we've deployed a new Nagios server (same settings) and started to transfer services and hosts to the new server.

Now the old server is calming down (still to high) and the new server has the problem. It looks like we've moved the problem (read: cause) to the new server. How can we troubleshoot this problem.

When we reboot the server is pretty calm but over time load starts increasing. Nagios + NagVis doesn't work properly in the first place (also a problem on the old server). We need to "apply configuration" to get the thing working. When that proces finishes (3 green icons in website) it takes about 1 minute before nagvis is ready (blue icons saying that the backend is not there)
Nieuwe afbeelding (1).png
It takes ages to view 1 hosts with 5 services.
Nieuwe afbeelding (2).png
Hostcheck remains gray untill we apply the configuration after a reboot. Then it takes 1 minute to work.
Nieuwe afbeelding (3).png
Abnormal server load.

Re: Nagios Load + CPU extremely high but why?

Posted: Fri Jan 10, 2014 6:06 am
by fogier
screenshot of top.
Nieuwe afbeelding (4).png

Re: Nagios Load + CPU extremely high but why?

Posted: Fri Jan 10, 2014 6:31 am
by fogier
Update: when nobody is logged in to the nagios web interface, the server is calm. Problem seems to be in the webserver.

Update2: Run the following commands because or errors like this "PHP Notice: Undefined index: min_execution_time" in apache error log. In another thread on this forum this solves these errors.

service ndo2db stop
service nagios stop
service mysqld stop
cd /usr/local/nagiosxi/scripts
./repairmysql.sh nagios
shutdown -r now

Many indexes were fixed.

However, when I login to Nagios and simply refresh a page or query for one host throught the search box, the server peaks almost to 100% cpu usage from /usr/sbin/httpd proces.
Error log apache is empty and nothing new enters. So, something has changed.

Re: Nagios Load + CPU extremely high but why?

Posted: Fri Jan 10, 2014 10:36 am
by lmiltchev
Run the following commands and post the output in code wraps:

Code: Select all

sestatus
df -h
df -i
php -v
httpd -v
tail -50 /var/log/messages
tail -50 /var/log/mysqld.log
tail -50 /var/log/httpd/error_log
Have you changed anything on the system, such as upgrading Nagios Core or NagVis, that could potentially be causing the issue?

Re: Nagios Load + CPU extremely high but why?

Posted: Mon Jan 13, 2014 2:02 am
by fogier
Haven't changed anything. Is a new install. We think it has something to do with the status.dat (43MB) file (includes the comments). when we stop Nagios, the file is gone. After we start Nagios everything is fast until the status.dat file is build again. Then everything is slowing down and the cpu usage raises again.

Here the output of the commands you posted

sestatus

Code: Select all

SELinux status:                 disabled
df -h

Code: Select all

Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
                      7.5G  3.9G  3.2G  56% /
tmpfs                 1.9G     0  1.9G   0% /dev/shm
/dev/sda1             485M   50M  410M  11% /boot
df -i

Code: Select all

Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/mapper/VolGroup-lv_root
                      494832   66878  427954   14% /
tmpfs                 490119       1  490118    1% /dev/shm
/dev/sda1             128016      44  127972    1% /boot
php -v

Code: Select all

PHP 5.3.3 (cli) (built: Jul 12 2013 20:35:47)
Copyright (c) 1997-2010 The PHP Group
Zend Engine v2.3.0, Copyright (c) 1998-2010 Zend Technologies
httpd -v

Code: Select all

Server version: Apache/2.2.15 (Unix)
Server built:   Aug 13 2013 17:29:28
tail -50 /var/log/messages

Code: Select all

Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18086;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18012;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16849;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV14716;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV14283;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV14793;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18088;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16834;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16807;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV14319;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV14753;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18039;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16877;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18041;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18033;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV14235;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16792;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16864;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16830;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18052;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:18:39 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Total_Processes;CRITICAL;SOFT;1;PROCS CRITICAL: 669 processes with STATE = RSZDT
Jan 13 07:18:49 localhost nagios: HOST ALERT: R_H_Oradbs02;DOWN;HARD;1;PING CRITICAL - Packet loss = 90%, RTA = 3.08 ms
Jan 13 07:18:49 localhost nagios: HOST NOTIFICATION: fvloon;R_H_Oradbs02;DOWN;xi_host_notification_handler;PING CRITICAL - Packet loss = 90%, RTA = 3.08 ms
Jan 13 07:18:49 localhost nagios: HOST NOTIFICATION: fogier;R_H_Oradbs02;DOWN;xi_host_notification_handler;PING CRITICAL - Packet loss = 90%, RTA = 3.08 ms
Jan 13 07:19:39 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Total_Processes;OK;SOFT;2;PROCS OK: 95 processes with STATE = RSZDT
Jan 13 07:19:49 localhost nagios: SERVICE ALERT: D_H_SV14779;S_S_Authentication_Failure .nrpe;CRITICAL;SOFT;1;CRITICAL - (1 errors) - Jan 10 16:56:44 SV14779 sshd[6071]: error: PAM: Authentication failure for oracle from lt14488.dmdelft01.root.lan
Jan 13 07:19:59 localhost nagios: HOST ALERT: R_H_Oradbs02;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 3.54 ms
Jan 13 07:19:59 localhost nagios: HOST NOTIFICATION: fvloon;R_H_Oradbs02;UP;xi_host_notification_handler;PING OK - Packet loss = 0%, RTA = 3.54 ms
Jan 13 07:19:59 localhost nagios: HOST NOTIFICATION: fogier;R_H_Oradbs02;UP;xi_host_notification_handler;PING OK - Packet loss = 0%, RTA = 3.54 ms
Jan 13 07:20:49 localhost nagios: SERVICE ALERT: D_H_SV14779;S_S_Authentication_Failure .nrpe;OK;SOFT;2;OK - no errors or warnings
Jan 13 07:39:39 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Total_Processes;CRITICAL;SOFT;1;PROCS CRITICAL: 925 processes with STATE = RSZDT
Jan 13 07:40:39 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Total_Processes;OK;SOFT;2;PROCS OK: 96 processes with STATE = RSZDT
Jan 13 07:42:59 localhost nagios: SERVICE FLAPPING ALERT: D_H_SV18016;Intern_g7 Cluster;STOPPED; Service appears to have stopped flapping (3.8% change < 5.0% threshold)
Jan 13 07:51:29 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Current_Load;CRITICAL;SOFT;1;CRITICAL - load average: 13.43, 7.19, 3.55
Jan 13 07:51:39 localhost nagios: HOST ALERT: R_H_Oradbs02;DOWN;HARD;1;PING CRITICAL - Packet loss = 100%
Jan 13 07:51:39 localhost nagios: HOST NOTIFICATION: fvloon;R_H_Oradbs02;DOWN;xi_host_notification_handler;PING CRITICAL - Packet loss = 100%
Jan 13 07:51:40 localhost nagios: HOST NOTIFICATION: fogier;R_H_Oradbs02;DOWN;xi_host_notification_handler;PING CRITICAL - Packet loss = 100%
Jan 13 07:52:29 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Current_Load;CRITICAL;SOFT;2;CRITICAL - load average: 15.60, 8.90, 4.36
Jan 13 07:52:49 localhost nagios: HOST ALERT: R_H_Oradbs02;UP;HARD;1;PING OK - Packet loss = 50%, RTA = 3.89 ms
Jan 13 07:52:49 localhost nagios: HOST NOTIFICATION: fvloon;R_H_Oradbs02;UP;xi_host_notification_handler;PING OK - Packet loss = 50%, RTA = 3.89 ms
Jan 13 07:52:51 localhost nagios: HOST NOTIFICATION: fogier;R_H_Oradbs02;UP;xi_host_notification_handler;PING OK - Packet loss = 50%, RTA = 3.89 ms
Jan 13 07:53:02 localhost nagios: HOST ALERT: R_H_Allegro-prod;DOWN;HARD;1;PING CRITICAL - Packet loss = 100%
Jan 13 07:53:03 localhost nagios: HOST NOTIFICATION: fogier;R_H_Allegro-prod;DOWN;xi_host_notification_handler;PING CRITICAL - Packet loss = 100%
Jan 13 07:53:41 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Current_Load;CRITICAL;SOFT;3;CRITICAL - load average: 18.75, 11.18, 5.46
Jan 13 07:54:21 localhost nagios: HOST ALERT: R_H_Allegro-prod;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 6.47 ms
Jan 13 07:54:21 localhost nagios: HOST NOTIFICATION: fogier;R_H_Allegro-prod;UP;xi_host_notification_handler;PING OK - Packet loss = 0%, RTA = 6.47 ms
Jan 13 07:54:40 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Current_Load;CRITICAL;HARD;4;CRITICAL - load average: 21.82, 13.62, 6.66
Jan 13 07:54:40 localhost nagios: SERVICE NOTIFICATION: tverstegen;S_H_SV14236;S_S_Current_Load;CRITICAL;xi_service_notification_handler;CRITICAL - load average: 21.82, 13.62, 6.66
Jan 13 07:54:42 localhost nagios: SERVICE NOTIFICATION: nagiosadmin;S_H_SV14236;S_S_Current_Load;CRITICAL;xi_service_notification_handler;CRITICAL - load average: 21.82, 13.62, 6.66
Jan 13 07:54:43 localhost nagios: SERVICE NOTIFICATION: fogier;S_H_SV14236;S_S_Current_Load;CRITICAL;xi_service_notification_handler;CRITICAL - load average: 21.82, 13.62, 6.66
tail -50 /var/log/mysqld.log

Code: Select all

140110 11:22:49  InnoDB: Completed initialization of buffer pool
140110 11:22:49  InnoDB: Started; log sequence number 0 44233
140110 11:22:49 [Note] Event Scheduler: Loaded 0 events
140110 11:22:49 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.69'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
140110 12:52:53 [Note] /usr/libexec/mysqld: Normal shutdown

140110 12:52:53 [Note] Event Scheduler: Purging the queue. 0 events
140110 12:52:53  InnoDB: Starting shutdown...
140110 12:52:54  InnoDB: Shutdown completed; log sequence number 0 44233
140110 12:52:54 [Note] /usr/libexec/mysqld: Shutdown complete

140110 12:52:54 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140110 12:53:33 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140110 12:53:33  InnoDB: Initializing buffer pool, size = 8.0M
140110 12:53:33  InnoDB: Completed initialization of buffer pool
140110 12:53:33  InnoDB: Started; log sequence number 0 44233
140110 12:53:33 [Note] Event Scheduler: Loaded 0 events
140110 12:53:33 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.69'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
140110 12:55:19 [Note] /usr/libexec/mysqld: Normal shutdown

140110 12:55:19 [Note] Event Scheduler: Purging the queue. 0 events
140110 12:55:19  InnoDB: Starting shutdown...
140110 12:55:24  InnoDB: Shutdown completed; log sequence number 0 44233
140110 12:55:24 [Note] /usr/libexec/mysqld: Shutdown complete

140110 12:55:24 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140110 12:56:05 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140110 12:56:05  InnoDB: Initializing buffer pool, size = 8.0M
140110 12:56:05  InnoDB: Completed initialization of buffer pool
140110 12:56:05  InnoDB: Started; log sequence number 0 44233
140110 12:56:06 [Note] Event Scheduler: Loaded 0 events
140110 12:56:06 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.69'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
140113  5:50:04 [Note] /usr/libexec/mysqld: Normal shutdown

140113  5:50:04 [Note] Event Scheduler: Purging the queue. 0 events
140113  5:50:07  InnoDB: Starting shutdown...
140113  5:50:10  InnoDB: Shutdown completed; log sequence number 0 44233
140113  5:50:10 [Note] /usr/libexec/mysqld: Shutdown complete

140113 05:50:10 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140113 05:50:42 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140113  5:50:42  InnoDB: Initializing buffer pool, size = 8.0M
140113  5:50:42  InnoDB: Completed initialization of buffer pool
140113  5:50:42  InnoDB: Started; log sequence number 0 44233
140113  5:50:42 [Note] Event Scheduler: Loaded 0 events
140113  5:50:42 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.69'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
tail -50 /var/log/httpd/error_log

Code: Select all

[Sun Jan 12 03:50:05 2014] [notice] Digest: generating secret for digest authentication ...
[Sun Jan 12 03:50:05 2014] [notice] Digest: done
[Sun Jan 12 03:50:05 2014] [notice] Apache/2.2.15 (Unix) DAV/2 PHP/5.3.3 mod_ssl/2.2.15 OpenSSL/1.0.0-fips configured -- resuming normal operations
[Mon Jan 13 05:50:03 2014] [notice] caught SIGTERM, shutting down
[Mon Jan 13 05:50:46 2014] [notice] suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)
[Mon Jan 13 05:50:46 2014] [notice] Digest: generating secret for digest authentication ...
[Mon Jan 13 05:50:46 2014] [notice] Digest: done
[Mon Jan 13 05:50:48 2014] [notice] Apache/2.2.15 (Unix) DAV/2 PHP/5.3.3 mod_ssl/2.2.15 OpenSSL/1.0.0-fips configured -- resuming normal operations

Re: Nagios Load + CPU extremely high but why?

Posted: Mon Jan 13, 2014 9:41 am
by slansing
Are these users running a lot of reports, or querying the database quite often when they are logged in? (This can attribute to higher load.) Or are they just watching a dashboard or a view.

Re: Nagios Load + CPU extremely high but why?

Posted: Wed Jan 15, 2014 1:35 am
by fogier
Right now only 2 people use Nagios. When we browse through the website, CPU raises. So no reports or querying the database (except the standard refresh from Nagios itself). We've removed a recurring downtime setting for a servicegroup so that our status.dat shrinkked (don't know if that's a word ;) ) from 36MB to 11MB. No visible improvement.

Re: Nagios Load + CPU extremely high but why?

Posted: Wed Jan 15, 2014 4:49 am
by fogier
It looks like we've found the cause of the issue. We've removed all the scheduled downtime entries. Now we have a status.dat of 1,8MB (instead of 36MB). The website responds fast and without issues.

Re: Nagios Load + CPU extremely high but why?

Posted: Wed Jan 15, 2014 9:49 am
by tmcdonald
Sweet. Gonna close this up now, feel free to open another thread if you have more questions.