Nagios Load + CPU extremely high but why?

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
fogier
Posts: 66
Joined: Fri Jul 13, 2012 12:53 am

Nagios Load + CPU extremely high but why?

Post by fogier »

We have a CentOS v6.3 64bit system (appliance on your website) with 1vCPU and 4GB RAM running on vSphere 5.0 Update 2 (+ some patches).
Since several weeks we've got problems with high server load and cpu usage. Nothing (that we can think of) has happened arround that period.
Now we've deployed a new Nagios server (same settings) and started to transfer services and hosts to the new server.

Now the old server is calming down (still to high) and the new server has the problem. It looks like we've moved the problem (read: cause) to the new server. How can we troubleshoot this problem.

When we reboot the server is pretty calm but over time load starts increasing. Nagios + NagVis doesn't work properly in the first place (also a problem on the old server). We need to "apply configuration" to get the thing working. When that proces finishes (3 green icons in website) it takes about 1 minute before nagvis is ready (blue icons saying that the backend is not there)
Nieuwe afbeelding (1).png
It takes ages to view 1 hosts with 5 services.
Nieuwe afbeelding (2).png
Hostcheck remains gray untill we apply the configuration after a reboot. Then it takes 1 minute to work.
Nieuwe afbeelding (3).png
Abnormal server load.
You do not have the required permissions to view the files attached to this post.
fogier
Posts: 66
Joined: Fri Jul 13, 2012 12:53 am

Re: Nagios Load + CPU extremely high but why?

Post by fogier »

screenshot of top.
Nieuwe afbeelding (4).png
You do not have the required permissions to view the files attached to this post.
fogier
Posts: 66
Joined: Fri Jul 13, 2012 12:53 am

Re: Nagios Load + CPU extremely high but why?

Post by fogier »

Update: when nobody is logged in to the nagios web interface, the server is calm. Problem seems to be in the webserver.

Update2: Run the following commands because or errors like this "PHP Notice: Undefined index: min_execution_time" in apache error log. In another thread on this forum this solves these errors.

service ndo2db stop
service nagios stop
service mysqld stop
cd /usr/local/nagiosxi/scripts
./repairmysql.sh nagios
shutdown -r now

Many indexes were fixed.

However, when I login to Nagios and simply refresh a page or query for one host throught the search box, the server peaks almost to 100% cpu usage from /usr/sbin/httpd proces.
Error log apache is empty and nothing new enters. So, something has changed.
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Nagios Load + CPU extremely high but why?

Post by lmiltchev »

Run the following commands and post the output in code wraps:

Code: Select all

sestatus
df -h
df -i
php -v
httpd -v
tail -50 /var/log/messages
tail -50 /var/log/mysqld.log
tail -50 /var/log/httpd/error_log
Have you changed anything on the system, such as upgrading Nagios Core or NagVis, that could potentially be causing the issue?
Be sure to check out our Knowledgebase for helpful articles and solutions!
fogier
Posts: 66
Joined: Fri Jul 13, 2012 12:53 am

Re: Nagios Load + CPU extremely high but why?

Post by fogier »

Haven't changed anything. Is a new install. We think it has something to do with the status.dat (43MB) file (includes the comments). when we stop Nagios, the file is gone. After we start Nagios everything is fast until the status.dat file is build again. Then everything is slowing down and the cpu usage raises again.

Here the output of the commands you posted

sestatus

Code: Select all

SELinux status:                 disabled
df -h

Code: Select all

Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
                      7.5G  3.9G  3.2G  56% /
tmpfs                 1.9G     0  1.9G   0% /dev/shm
/dev/sda1             485M   50M  410M  11% /boot
df -i

Code: Select all

Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/mapper/VolGroup-lv_root
                      494832   66878  427954   14% /
tmpfs                 490119       1  490118    1% /dev/shm
/dev/sda1             128016      44  127972    1% /boot
php -v

Code: Select all

PHP 5.3.3 (cli) (built: Jul 12 2013 20:35:47)
Copyright (c) 1997-2010 The PHP Group
Zend Engine v2.3.0, Copyright (c) 1998-2010 Zend Technologies
httpd -v

Code: Select all

Server version: Apache/2.2.15 (Unix)
Server built:   Aug 13 2013 17:29:28
tail -50 /var/log/messages

Code: Select all

Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18086;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18012;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16849;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV14716;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV14283;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV14793;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18088;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16834;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16807;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV14319;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV14753;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18039;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16877;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18041;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18033;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV14235;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16792;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16864;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV16830;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:15:05 localhost nagios: SERVICE DOWNTIME ALERT: D_H_SV18052;D_S_VSS_Service .nrpe;STARTED; Service has entered a period of scheduled downtime
Jan 13 07:18:39 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Total_Processes;CRITICAL;SOFT;1;PROCS CRITICAL: 669 processes with STATE = RSZDT
Jan 13 07:18:49 localhost nagios: HOST ALERT: R_H_Oradbs02;DOWN;HARD;1;PING CRITICAL - Packet loss = 90%, RTA = 3.08 ms
Jan 13 07:18:49 localhost nagios: HOST NOTIFICATION: fvloon;R_H_Oradbs02;DOWN;xi_host_notification_handler;PING CRITICAL - Packet loss = 90%, RTA = 3.08 ms
Jan 13 07:18:49 localhost nagios: HOST NOTIFICATION: fogier;R_H_Oradbs02;DOWN;xi_host_notification_handler;PING CRITICAL - Packet loss = 90%, RTA = 3.08 ms
Jan 13 07:19:39 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Total_Processes;OK;SOFT;2;PROCS OK: 95 processes with STATE = RSZDT
Jan 13 07:19:49 localhost nagios: SERVICE ALERT: D_H_SV14779;S_S_Authentication_Failure .nrpe;CRITICAL;SOFT;1;CRITICAL - (1 errors) - Jan 10 16:56:44 SV14779 sshd[6071]: error: PAM: Authentication failure for oracle from lt14488.dmdelft01.root.lan
Jan 13 07:19:59 localhost nagios: HOST ALERT: R_H_Oradbs02;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 3.54 ms
Jan 13 07:19:59 localhost nagios: HOST NOTIFICATION: fvloon;R_H_Oradbs02;UP;xi_host_notification_handler;PING OK - Packet loss = 0%, RTA = 3.54 ms
Jan 13 07:19:59 localhost nagios: HOST NOTIFICATION: fogier;R_H_Oradbs02;UP;xi_host_notification_handler;PING OK - Packet loss = 0%, RTA = 3.54 ms
Jan 13 07:20:49 localhost nagios: SERVICE ALERT: D_H_SV14779;S_S_Authentication_Failure .nrpe;OK;SOFT;2;OK - no errors or warnings
Jan 13 07:39:39 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Total_Processes;CRITICAL;SOFT;1;PROCS CRITICAL: 925 processes with STATE = RSZDT
Jan 13 07:40:39 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Total_Processes;OK;SOFT;2;PROCS OK: 96 processes with STATE = RSZDT
Jan 13 07:42:59 localhost nagios: SERVICE FLAPPING ALERT: D_H_SV18016;Intern_g7 Cluster;STOPPED; Service appears to have stopped flapping (3.8% change < 5.0% threshold)
Jan 13 07:51:29 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Current_Load;CRITICAL;SOFT;1;CRITICAL - load average: 13.43, 7.19, 3.55
Jan 13 07:51:39 localhost nagios: HOST ALERT: R_H_Oradbs02;DOWN;HARD;1;PING CRITICAL - Packet loss = 100%
Jan 13 07:51:39 localhost nagios: HOST NOTIFICATION: fvloon;R_H_Oradbs02;DOWN;xi_host_notification_handler;PING CRITICAL - Packet loss = 100%
Jan 13 07:51:40 localhost nagios: HOST NOTIFICATION: fogier;R_H_Oradbs02;DOWN;xi_host_notification_handler;PING CRITICAL - Packet loss = 100%
Jan 13 07:52:29 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Current_Load;CRITICAL;SOFT;2;CRITICAL - load average: 15.60, 8.90, 4.36
Jan 13 07:52:49 localhost nagios: HOST ALERT: R_H_Oradbs02;UP;HARD;1;PING OK - Packet loss = 50%, RTA = 3.89 ms
Jan 13 07:52:49 localhost nagios: HOST NOTIFICATION: fvloon;R_H_Oradbs02;UP;xi_host_notification_handler;PING OK - Packet loss = 50%, RTA = 3.89 ms
Jan 13 07:52:51 localhost nagios: HOST NOTIFICATION: fogier;R_H_Oradbs02;UP;xi_host_notification_handler;PING OK - Packet loss = 50%, RTA = 3.89 ms
Jan 13 07:53:02 localhost nagios: HOST ALERT: R_H_Allegro-prod;DOWN;HARD;1;PING CRITICAL - Packet loss = 100%
Jan 13 07:53:03 localhost nagios: HOST NOTIFICATION: fogier;R_H_Allegro-prod;DOWN;xi_host_notification_handler;PING CRITICAL - Packet loss = 100%
Jan 13 07:53:41 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Current_Load;CRITICAL;SOFT;3;CRITICAL - load average: 18.75, 11.18, 5.46
Jan 13 07:54:21 localhost nagios: HOST ALERT: R_H_Allegro-prod;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 6.47 ms
Jan 13 07:54:21 localhost nagios: HOST NOTIFICATION: fogier;R_H_Allegro-prod;UP;xi_host_notification_handler;PING OK - Packet loss = 0%, RTA = 6.47 ms
Jan 13 07:54:40 localhost nagios: SERVICE ALERT: S_H_SV14236;S_S_Current_Load;CRITICAL;HARD;4;CRITICAL - load average: 21.82, 13.62, 6.66
Jan 13 07:54:40 localhost nagios: SERVICE NOTIFICATION: tverstegen;S_H_SV14236;S_S_Current_Load;CRITICAL;xi_service_notification_handler;CRITICAL - load average: 21.82, 13.62, 6.66
Jan 13 07:54:42 localhost nagios: SERVICE NOTIFICATION: nagiosadmin;S_H_SV14236;S_S_Current_Load;CRITICAL;xi_service_notification_handler;CRITICAL - load average: 21.82, 13.62, 6.66
Jan 13 07:54:43 localhost nagios: SERVICE NOTIFICATION: fogier;S_H_SV14236;S_S_Current_Load;CRITICAL;xi_service_notification_handler;CRITICAL - load average: 21.82, 13.62, 6.66
tail -50 /var/log/mysqld.log

Code: Select all

140110 11:22:49  InnoDB: Completed initialization of buffer pool
140110 11:22:49  InnoDB: Started; log sequence number 0 44233
140110 11:22:49 [Note] Event Scheduler: Loaded 0 events
140110 11:22:49 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.69'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
140110 12:52:53 [Note] /usr/libexec/mysqld: Normal shutdown

140110 12:52:53 [Note] Event Scheduler: Purging the queue. 0 events
140110 12:52:53  InnoDB: Starting shutdown...
140110 12:52:54  InnoDB: Shutdown completed; log sequence number 0 44233
140110 12:52:54 [Note] /usr/libexec/mysqld: Shutdown complete

140110 12:52:54 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140110 12:53:33 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140110 12:53:33  InnoDB: Initializing buffer pool, size = 8.0M
140110 12:53:33  InnoDB: Completed initialization of buffer pool
140110 12:53:33  InnoDB: Started; log sequence number 0 44233
140110 12:53:33 [Note] Event Scheduler: Loaded 0 events
140110 12:53:33 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.69'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
140110 12:55:19 [Note] /usr/libexec/mysqld: Normal shutdown

140110 12:55:19 [Note] Event Scheduler: Purging the queue. 0 events
140110 12:55:19  InnoDB: Starting shutdown...
140110 12:55:24  InnoDB: Shutdown completed; log sequence number 0 44233
140110 12:55:24 [Note] /usr/libexec/mysqld: Shutdown complete

140110 12:55:24 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140110 12:56:05 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140110 12:56:05  InnoDB: Initializing buffer pool, size = 8.0M
140110 12:56:05  InnoDB: Completed initialization of buffer pool
140110 12:56:05  InnoDB: Started; log sequence number 0 44233
140110 12:56:06 [Note] Event Scheduler: Loaded 0 events
140110 12:56:06 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.69'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
140113  5:50:04 [Note] /usr/libexec/mysqld: Normal shutdown

140113  5:50:04 [Note] Event Scheduler: Purging the queue. 0 events
140113  5:50:07  InnoDB: Starting shutdown...
140113  5:50:10  InnoDB: Shutdown completed; log sequence number 0 44233
140113  5:50:10 [Note] /usr/libexec/mysqld: Shutdown complete

140113 05:50:10 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140113 05:50:42 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140113  5:50:42  InnoDB: Initializing buffer pool, size = 8.0M
140113  5:50:42  InnoDB: Completed initialization of buffer pool
140113  5:50:42  InnoDB: Started; log sequence number 0 44233
140113  5:50:42 [Note] Event Scheduler: Loaded 0 events
140113  5:50:42 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.69'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
tail -50 /var/log/httpd/error_log

Code: Select all

[Sun Jan 12 03:50:05 2014] [notice] Digest: generating secret for digest authentication ...
[Sun Jan 12 03:50:05 2014] [notice] Digest: done
[Sun Jan 12 03:50:05 2014] [notice] Apache/2.2.15 (Unix) DAV/2 PHP/5.3.3 mod_ssl/2.2.15 OpenSSL/1.0.0-fips configured -- resuming normal operations
[Mon Jan 13 05:50:03 2014] [notice] caught SIGTERM, shutting down
[Mon Jan 13 05:50:46 2014] [notice] suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)
[Mon Jan 13 05:50:46 2014] [notice] Digest: generating secret for digest authentication ...
[Mon Jan 13 05:50:46 2014] [notice] Digest: done
[Mon Jan 13 05:50:48 2014] [notice] Apache/2.2.15 (Unix) DAV/2 PHP/5.3.3 mod_ssl/2.2.15 OpenSSL/1.0.0-fips configured -- resuming normal operations
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Nagios Load + CPU extremely high but why?

Post by slansing »

Are these users running a lot of reports, or querying the database quite often when they are logged in? (This can attribute to higher load.) Or are they just watching a dashboard or a view.
fogier
Posts: 66
Joined: Fri Jul 13, 2012 12:53 am

Re: Nagios Load + CPU extremely high but why?

Post by fogier »

Right now only 2 people use Nagios. When we browse through the website, CPU raises. So no reports or querying the database (except the standard refresh from Nagios itself). We've removed a recurring downtime setting for a servicegroup so that our status.dat shrinkked (don't know if that's a word ;) ) from 36MB to 11MB. No visible improvement.
fogier
Posts: 66
Joined: Fri Jul 13, 2012 12:53 am

Re: Nagios Load + CPU extremely high but why?

Post by fogier »

It looks like we've found the cause of the issue. We've removed all the scheduled downtime entries. Now we have a status.dat of 1,8MB (instead of 36MB). The website responds fast and without issues.
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Nagios Load + CPU extremely high but why?

Post by tmcdonald »

Sweet. Gonna close this up now, feel free to open another thread if you have more questions.
Former Nagios employee
Locked