Nagios Load + CPU extremely high but why?
Nagios Load + CPU extremely high but why?
We have a CentOS v6.3 64bit system (appliance on your website) with 1vCPU and 4GB RAM running on vSphere 5.0 Update 2 (+ some patches).
Since several weeks we've got problems with high server load and cpu usage. Nothing (that we can think of) has happened arround that period.
Now we've deployed a new Nagios server (same settings) and started to transfer services and hosts to the new server.
Now the old server is calming down (still to high) and the new server has the problem. It looks like we've moved the problem (read: cause) to the new server. How can we troubleshoot this problem.
When we reboot the server is pretty calm but over time load starts increasing. Nagios + NagVis doesn't work properly in the first place (also a problem on the old server). We need to "apply configuration" to get the thing working. When that proces finishes (3 green icons in website) it takes about 1 minute before nagvis is ready (blue icons saying that the backend is not there)
It takes ages to view 1 hosts with 5 services.
Hostcheck remains gray untill we apply the configuration after a reboot. Then it takes 1 minute to work.
Abnormal server load.
Since several weeks we've got problems with high server load and cpu usage. Nothing (that we can think of) has happened arround that period.
Now we've deployed a new Nagios server (same settings) and started to transfer services and hosts to the new server.
Now the old server is calming down (still to high) and the new server has the problem. It looks like we've moved the problem (read: cause) to the new server. How can we troubleshoot this problem.
When we reboot the server is pretty calm but over time load starts increasing. Nagios + NagVis doesn't work properly in the first place (also a problem on the old server). We need to "apply configuration" to get the thing working. When that proces finishes (3 green icons in website) it takes about 1 minute before nagvis is ready (blue icons saying that the backend is not there)
It takes ages to view 1 hosts with 5 services.
Hostcheck remains gray untill we apply the configuration after a reboot. Then it takes 1 minute to work.
Abnormal server load.
You do not have the required permissions to view the files attached to this post.
Re: Nagios Load + CPU extremely high but why?
screenshot of top.
You do not have the required permissions to view the files attached to this post.
Re: Nagios Load + CPU extremely high but why?
Update: when nobody is logged in to the nagios web interface, the server is calm. Problem seems to be in the webserver.
Update2: Run the following commands because or errors like this "PHP Notice: Undefined index: min_execution_time" in apache error log. In another thread on this forum this solves these errors.
service ndo2db stop
service nagios stop
service mysqld stop
cd /usr/local/nagiosxi/scripts
./repairmysql.sh nagios
shutdown -r now
Many indexes were fixed.
However, when I login to Nagios and simply refresh a page or query for one host throught the search box, the server peaks almost to 100% cpu usage from /usr/sbin/httpd proces.
Error log apache is empty and nothing new enters. So, something has changed.
Update2: Run the following commands because or errors like this "PHP Notice: Undefined index: min_execution_time" in apache error log. In another thread on this forum this solves these errors.
service ndo2db stop
service nagios stop
service mysqld stop
cd /usr/local/nagiosxi/scripts
./repairmysql.sh nagios
shutdown -r now
Many indexes were fixed.
However, when I login to Nagios and simply refresh a page or query for one host throught the search box, the server peaks almost to 100% cpu usage from /usr/sbin/httpd proces.
Error log apache is empty and nothing new enters. So, something has changed.
Re: Nagios Load + CPU extremely high but why?
Run the following commands and post the output in code wraps:
Have you changed anything on the system, such as upgrading Nagios Core or NagVis, that could potentially be causing the issue?
Code: Select all
sestatus
df -h
df -i
php -v
httpd -v
tail -50 /var/log/messages
tail -50 /var/log/mysqld.log
tail -50 /var/log/httpd/error_logBe sure to check out our Knowledgebase for helpful articles and solutions!
Re: Nagios Load + CPU extremely high but why?
Haven't changed anything. Is a new install. We think it has something to do with the status.dat (43MB) file (includes the comments). when we stop Nagios, the file is gone. After we start Nagios everything is fast until the status.dat file is build again. Then everything is slowing down and the cpu usage raises again.
Here the output of the commands you posted
sestatus
df -h
df -i
php -v
httpd -v
tail -50 /var/log/messages
tail -50 /var/log/mysqld.log
tail -50 /var/log/httpd/error_log
Here the output of the commands you posted
sestatus
Code: Select all
SELinux status: disabledCode: Select all
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
7.5G 3.9G 3.2G 56% /
tmpfs 1.9G 0 1.9G 0% /dev/shm
/dev/sda1 485M 50M 410M 11% /bootCode: Select all
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/VolGroup-lv_root
494832 66878 427954 14% /
tmpfs 490119 1 490118 1% /dev/shm
/dev/sda1 128016 44 127972 1% /bootCode: Select all
PHP 5.3.3 (cli) (built: Jul 12 2013 20:35:47)
Copyright (c) 1997-2010 The PHP Group
Zend Engine v2.3.0, Copyright (c) 1998-2010 Zend TechnologiesCode: Select all
Server version: Apache/2.2.15 (Unix)
Server built: Aug 13 2013 17:29:28Code: Select all
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18086;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18012;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16849;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV14716;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV14283;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV14793;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18088;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16834;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16807;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV14319;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV14753;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18039;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16877;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18041;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18033;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV14235;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16792;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16864;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16830;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18052;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:18:39 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Total_Processes;CRITICAL;SOFT;1;PROCS CRITICAL: 669 processes with STATE = RSZDT
Jan 13 07:18:49 localhost nagios: HOST ALERT: R_H_Oradbs02;DOWN;HARD;1;PING CRITICAL - Packet loss = 90%, RTA = 3.08 ms
Jan 13 07:18:49 localhost nagios: HOST NOTIFICATION: fvloon;R_H_Oradbs02;DOWN;xi_host_notification_handler;PING CRITICAL - Packet loss = 90%, RTA = 3.08 ms
Jan 13 07:18:49 localhost nagios: HOST NOTIFICATION: fogier;R_H_Oradbs02;DOWN;xi_host_notification_handler;PING CRITICAL - Packet loss = 90%, RTA = 3.08 ms
Jan 13 07:19:39 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Total_Processes;OK;SOFT;2;PROCS OK: 95 processes with STATE = RSZDT
Jan 13 07:19:49 localhost nagios: SERVICE ALERT: D_H_SV14779;S_S_Authentication_Failure .nrpe;CRITICAL;SOFT;1;CRITICAL - (1 errors) - Jan 10 16:56:44 SV14779 sshd[6071]: error: PAM: Authentication failure for oracle from lt14488.dmdelft01.root.lan
Jan 13 07:19:59 localhost nagios: HOST ALERT: R_H_Oradbs02;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 3.54 ms
Jan 13 07:19:59 localhost nagios: HOST NOTIFICATION: fvloon;R_H_Oradbs02;UP;xi_host_notification_handler;PING OK - Packet loss = 0%, RTA = 3.54 ms
Jan 13 07:19:59 localhost nagios: HOST NOTIFICATION: fogier;R_H_Oradbs02;UP;xi_host_notification_handler;PING OK - Packet loss = 0%, RTA = 3.54 ms
Jan 13 07:20:49 localhost nagios: SERVICE ALERT: D_H_SV14779;S_S_Authentication_Failure .nrpe;OK;SOFT;2;OK - no errors or warnings
Jan 13 07:39:39 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Total_Processes;CRITICAL;SOFT;1;PROCS CRITICAL: 925 processes with STATE = RSZDT
Jan 13 07:40:39 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Total_Processes;OK;SOFT;2;PROCS OK: 96 processes with STATE = RSZDT
Jan 13 07:42:59 localhost nagios: SERVICE FLAPPING ALERT: D_H_SV18016;Intern_g7 Cluster;STOPPED; Service appears to have stopped flapping (3.8% change < 5.0% threshold)
Jan 13 07:51:29 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Current_Load;CRITICAL;SOFT;1;CRITICAL - load average: 13.43, 7.19, 3.55
Jan 13 07:51:39 localhost nagios: HOST ALERT: R_H_Oradbs02;DOWN;HARD;1;PING CRITICAL - Packet loss = 100%
Jan 13 07:51:39 localhost nagios: HOST NOTIFICATION: fvloon;R_H_Oradbs02;DOWN;xi_host_notification_handler;PING CRITICAL - Packet loss = 100%
Jan 13 07:51:40 localhost nagios: HOST NOTIFICATION: fogier;R_H_Oradbs02;DOWN;xi_host_notification_handler;PING CRITICAL - Packet loss = 100%
Jan 13 07:52:29 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Current_Load;CRITICAL;SOFT;2;CRITICAL - load average: 15.60, 8.90, 4.36
Jan 13 07:52:49 localhost nagios: HOST ALERT: R_H_Oradbs02;UP;HARD;1;PING OK - Packet loss = 50%, RTA = 3.89 ms
Jan 13 07:52:49 localhost nagios: HOST NOTIFICATION: fvloon;R_H_Oradbs02;UP;xi_host_notification_handler;PING OK - Packet loss = 50%, RTA = 3.89 ms
Jan 13 07:52:51 localhost nagios: HOST NOTIFICATION: fogier;R_H_Oradbs02;UP;xi_host_notification_handler;PING OK - Packet loss = 50%, RTA = 3.89 ms
Jan 13 07:53:02 localhost nagios: HOST ALERT: R_H_Allegro-prod;DOWN;HARD;1;PING CRITICAL - Packet loss = 100%
Jan 13 07:53:03 localhost nagios: HOST NOTIFICATION: fogier;R_H_Allegro-prod;DOWN;xi_host_notification_handler;PING CRITICAL - Packet loss = 100%
Jan 13 07:53:41 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Current_Load;CRITICAL;SOFT;3;CRITICAL - load average: 18.75, 11.18, 5.46
Jan 13 07:54:21 localhost nagios: HOST ALERT: R_H_Allegro-prod;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 6.47 ms
Jan 13 07:54:21 localhost nagios: HOST NOTIFICATION: fogier;R_H_Allegro-prod;UP;xi_host_notification_handler;PING OK - Packet loss = 0%, RTA = 6.47 ms
Jan 13 07:54:40 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Current_Load;CRITICAL;HARD;4;CRITICAL - load average: 21.82, 13.62, 6.66
Jan 13 07:54:40 localhost nagios: SERVICE NOTIFICATION: tverstegen;S_H_SV14236;S_S_Current_Load;CRITICAL;xi_service_notification_handler;CRITICAL - load average: 21.82, 13.62, 6.66
Jan 13 07:54:42 localhost nagios: SERVICE NOTIFICATION: nagiosadmin;S_H_SV14236;S_S_Current_Load;CRITICAL;xi_service_notification_handler;CRITICAL - load average: 21.82, 13.62, 6.66
Jan 13 07:54:43 localhost nagios: SERVICE NOTIFICATION: fogier;S_H_SV14236;S_S_Current_Load;CRITICAL;xi_service_notification_handler;CRITICAL - load average: 21.82, 13.62, 6.66Code: Select all
140110 11:22:49 InnoDB: Completed initialization of buffer pool
140110 11:22:49 InnoDB: Started; log sequence number 0 44233
140110 11:22:49 [Note] Event Scheduler: Loaded 0 events
140110 11:22:49 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.69' socket: '/var/lib/mysql/mysql.sock' port: 3306 Source distribution
140110 12:52:53 [Note] /usr/libexec/mysqld: Normal shutdown
140110 12:52:53 [Note] Event Scheduler: Purging the queue. 0 events
140110 12:52:53 InnoDB: Starting shutdown...
140110 12:52:54 InnoDB: Shutdown completed; log sequence number 0 44233
140110 12:52:54 [Note] /usr/libexec/mysqld: Shutdown complete
140110 12:52:54 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140110 12:53:33 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140110 12:53:33 InnoDB: Initializing buffer pool, size = 8.0M
140110 12:53:33 InnoDB: Completed initialization of buffer pool
140110 12:53:33 InnoDB: Started; log sequence number 0 44233
140110 12:53:33 [Note] Event Scheduler: Loaded 0 events
140110 12:53:33 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.69' socket: '/var/lib/mysql/mysql.sock' port: 3306 Source distribution
140110 12:55:19 [Note] /usr/libexec/mysqld: Normal shutdown
140110 12:55:19 [Note] Event Scheduler: Purging the queue. 0 events
140110 12:55:19 InnoDB: Starting shutdown...
140110 12:55:24 InnoDB: Shutdown completed; log sequence number 0 44233
140110 12:55:24 [Note] /usr/libexec/mysqld: Shutdown complete
140110 12:55:24 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140110 12:56:05 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140110 12:56:05 InnoDB: Initializing buffer pool, size = 8.0M
140110 12:56:05 InnoDB: Completed initialization of buffer pool
140110 12:56:05 InnoDB: Started; log sequence number 0 44233
140110 12:56:06 [Note] Event Scheduler: Loaded 0 events
140110 12:56:06 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.69' socket: '/var/lib/mysql/mysql.sock' port: 3306 Source distribution
140113 5:50:04 [Note] /usr/libexec/mysqld: Normal shutdown
140113 5:50:04 [Note] Event Scheduler: Purging the queue. 0 events
140113 5:50:07 InnoDB: Starting shutdown...
140113 5:50:10 InnoDB: Shutdown completed; log sequence number 0 44233
140113 5:50:10 [Note] /usr/libexec/mysqld: Shutdown complete
140113 05:50:10 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140113 05:50:42 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140113 5:50:42 InnoDB: Initializing buffer pool, size = 8.0M
140113 5:50:42 InnoDB: Completed initialization of buffer pool
140113 5:50:42 InnoDB: Started; log sequence number 0 44233
140113 5:50:42 [Note] Event Scheduler: Loaded 0 events
140113 5:50:42 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.69' socket: '/var/lib/mysql/mysql.sock' port: 3306 Source distributionCode: Select all
[Sun Jan 12 03:50:05 2014] [notice] Digest: generating secret for digest authentication ...
[Sun Jan 12 03:50:05 2014] [notice] Digest: done
[Sun Jan 12 03:50:05 2014] [notice] Apache/2.2.15 (Unix) DAV/2 PHP/5.3.3 mod_ssl/2.2.15 OpenSSL/1.0.0-fips configured -- resuming normal operations
[Mon Jan 13 05:50:03 2014] [notice] caught SIGTERM, shutting down
[Mon Jan 13 05:50:46 2014] [notice] suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)
[Mon Jan 13 05:50:46 2014] [notice] Digest: generating secret for digest authentication ...
[Mon Jan 13 05:50:46 2014] [notice] Digest: done
[Mon Jan 13 05:50:48 2014] [notice] Apache/2.2.15 (Unix) DAV/2 PHP/5.3.3 mod_ssl/2.2.15 OpenSSL/1.0.0-fips configured -- resuming normal operations-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Nagios Load + CPU extremely high but why?
Are these users running a lot of reports, or querying the database quite often when they are logged in? (This can attribute to higher load.) Or are they just watching a dashboard or a view.
Re: Nagios Load + CPU extremely high but why?
Right now only 2 people use Nagios. When we browse through the website, CPU raises. So no reports or querying the database (except the standard refresh from Nagios itself). We've removed a recurring downtime setting for a servicegroup so that our status.dat shrinkked (don't know if that's a word
) from 36MB to 11MB. No visible improvement.
Re: Nagios Load + CPU extremely high but why?
It looks like we've found the cause of the issue. We've removed all the scheduled downtime entries. Now we have a status.dat of 1,8MB (instead of 36MB). The website responds fast and without issues.
Re: Nagios Load + CPU extremely high but why?
Sweet. Gonna close this up now, feel free to open another thread if you have more questions.
Former Nagios employee