Page 1 of 5

Nagios performance trouble

Posted: Tue Sep 18, 2012 1:21 pm
by hhlodge
Hi,

Over the past couple weeks I've been getting more and more notifications for CPU load on my XI server. It's physical server with CentOS 5.6 64 bit with XI 2011R3.2. Now it's just starting to perform badly in the web interfaces and I have php commands timing out in the logs. I can see loads in the 8-10 range often now and vmstat shows many processes blocked. I've got 8GB of RAM and 1.5GB unused so it's not swapping. Something is suddenly using a lot of IO. In top, httpd and mysqld are usually the top CPU users. Based on some reading, this may show what processes are blocking:

Code: Select all

#  ps aux | grep " [D]"
root      2349  0.0  0.0      0     0 ?        D<   Sep17   0:28 [kjournald]
postgres  4270  0.0  0.1 120880  8304 ?        D    Sep17   0:00 postgres: writer process                                    
root      7549  0.0  0.0      0     0 ?        D    14:12   0:00 [pdflush]
postgres  7784  0.0  0.0 122060  4204 ?        D    14:13   0:00 postgres: nagiosxi nagiosxi 127.0.0.1(40864) UPDATE         
I haven't added many new hosts or services so I suspect something is getting to where it needs some cleanup and/or tuning. Any suggestions?

Re: Nagios performance trouble

Posted: Tue Sep 18, 2012 1:42 pm
by slansing
Check your tmp directory for temporary check files:

Code: Select all

cd /tmp

ll
You may see an over abundance of files that look similar to this:

Code: Select all

check9283aZ0
Delete all of them by issuing this:

Code: Select all

rm -rf check*
Have you experienced any database related issues?

Re: Nagios performance trouble

Posted: Tue Sep 18, 2012 1:44 pm
by hhlodge
There was only 15 of those check files and I've removed them. I am not aware of any database issues but I've not looked for any signs of trouble either.

Re: Nagios performance trouble

Posted: Tue Sep 18, 2012 1:50 pm
by slansing
if you tail the system log are you seeing anything strange?:

Code: Select all

tail /var/log/messages
I would also recommend checking this out:

http://assets.nagios.com/downloads/nagi ... rmance.pdf

Re: Nagios performance trouble

Posted: Tue Sep 18, 2012 2:10 pm
by hhlodge
This looks grim.

Code: Select all

Sep 18 02:14:46 psm-itmon nagios: Warning: Global service event handler command '/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_event.php --handler-type=service 
--host="localhost" --service="Total Processes" --hostaddress="127.0.0.1" --hoststate=UP --hoststateid=0 --hosteventid=0 --hostproblemid=0 --servicestate=WARNING --servicesta
teid=1 --lastservicestate=WARNING --lastservicestateid=1 --servicestatetype=HARD --currentattempt=4 --maxattempts=4 --serviceeventid=200278 --serviceproblemid=90283 --servic
eoutput="PROCS WARNING: 296 processes with STATE = RSZDT" --longserviceoutput=""' timed out after 30 seconds
Sep 18 02:14:46 psm-itmon nagios: SERVICE ALERT: BTMSPROJECT;Memory Usage;WARNING;HARD;3;WARNING - [Triggered by _MemUsed%>80] - Physical Memory: Total: 2.999GB - Used: 2.50
2GB (83%) - Free: 0.497GB (17%) 
Sep 18 02:15:02 psm-itmon kernel: INFO: task postmaster:4270 blocked for more than 120 seconds.
Sep 18 02:15:02 psm-itmon kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 18 02:15:02 psm-itmon kernel: postmaster    D ffff810001015120     0  4270   4186          4271  4252 (NOTLB)
Sep 18 02:15:02 psm-itmon kernel:  ffff81021a76fd88 0000000000000086 0000000000000296 0000000000000003
Sep 18 02:15:02 psm-itmon kernel:  ffff81021a76fd18 0000000000000009 ffff81021bb65040 ffff81022ff26080
Sep 18 02:15:02 psm-itmon kernel:  0000384bdb78ada6 00000000000007bc ffff81021bb65228 000000027ff25f90
Sep 18 02:15:02 psm-itmon kernel: Call Trace:
Sep 18 02:15:02 psm-itmon kernel:  [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5
Sep 18 02:15:02 psm-itmon kernel:  [<ffffffff800a28f3>] autoremove_wake_function+0x0/0x2e
Sep 18 02:15:02 psm-itmon kernel:  [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
Sep 18 02:15:02 psm-itmon kernel:  [<ffffffff8002fe49>] __writeback_single_inode+0x1d9/0x318
Sep 18 02:15:02 psm-itmon kernel:  [<ffffffff8001ebc1>] __dentry_open+0x101/0x1dc
Sep 18 02:15:02 psm-itmon kernel:  [<ffffffff800f5a1a>] sync_inode+0x24/0x33
Sep 18 02:15:02 psm-itmon kernel:  [<ffffffff8804c370>] :ext3:ext3_sync_file+0xcc/0xf8
Sep 18 02:15:02 psm-itmon kernel:  [<ffffffff800504d1>] do_fsync+0x52/0xa4
Sep 18 02:15:02 psm-itmon kernel:  [<ffffffff800e3463>] __do_fsync+0x23/0x36
Sep 18 02:15:02 psm-itmon kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Sep 18 02:15:02 psm-itmon kernel: 
Sep 18 02:15:02 psm-itmon kernel: INFO: task postmaster:1760 blocked for more than 120 seconds.
Sep 18 02:15:02 psm-itmon kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 18 02:15:02 psm-itmon kernel: postmaster    D ffff81000100caa0     0  1760   4186          1770 32690 (NOTLB)
Sep 18 02:15:02 psm-itmon kernel:  ffff810074231ea8 0000000000000086 00000000ffffffd8 ffffffff8001264b
Sep 18 02:15:02 psm-itmon kernel:  800000019cc16067 000000000000000a ffff8100703c1100 ffff81022ff17100
Sep 18 02:15:02 psm-itmon kernel:  00003856c6672660 0000000000050eb3 ffff8100703c12e8 000000018002232e
Sep 18 02:15:02 psm-itmon kernel: Call Trace:
Sep 18 02:15:02 psm-itmon kernel:  [<ffffffff8001264b>] may_open+0x65/0x22f
Sep 18 02:15:02 psm-itmon kernel:  [<ffffffff80063c4f>] __mutex_lock_slowpath+0x60/0x9b
Sep 18 02:15:02 psm-itmon kernel:  [<ffffffff80063c99>] .text.lock.mutex+0xf/0x14
Sep 18 02:15:02 psm-itmon kernel:  [<ffffffff80013eef>] generic_file_llseek+0x2a/0x8b
Sep 18 02:15:02 psm-itmon kernel:  [<ffffffff80025707>] sys_lseek+0x40/0x60
Sep 18 02:15:02 psm-itmon kernel:  [<ffffffff8005d229>] tracesys+0x71/0xe0
Sep 18 02:15:02 psm-itmon kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
And my iowait sar data up to this time looks increasingly dire.

Code: Select all

08:10:01 AM       CPU     %user     %nice   %system   %iowait    %steal     %idle
08:20:01 AM       all      5.31      0.00      0.86      0.02      0.00     93.82
08:30:01 AM       all      5.29      0.00      0.83      0.02      0.00     93.87
08:40:01 AM       all      5.30      0.00      0.87      0.02      0.00     93.81
08:50:01 AM       all      6.34      0.00      0.93      0.02      0.00     92.71
09:00:01 AM       all      6.60      0.00      0.87      0.02      0.00     92.51
09:10:01 AM       all      7.18      0.14      0.96      0.02      0.00     91.70
09:20:01 AM       all      7.22      0.00      0.94      0.02      0.00     91.82
09:30:01 AM       all      6.82      0.00      0.88      0.02      0.00     92.28
09:40:01 AM       all      6.80      0.00      0.93      0.02      0.00     92.25
09:50:01 AM       all      7.48      0.00      0.98      0.02      0.00     91.52
10:00:01 AM       all      7.07      0.00      0.88      0.02      0.00     92.03
10:10:01 AM       all      7.16      0.14      0.94      0.02      0.00     91.73
10:20:01 AM       all      7.08      0.00      0.93      0.02      0.00     91.97
10:30:01 AM       all      7.11      0.00      0.89      0.02      0.00     91.98
10:40:01 AM       all      7.26      0.00      0.94      0.02      0.00     91.78
10:50:01 AM       all      7.28      0.00      0.93      0.03      0.00     91.76
11:00:01 AM       all      7.13      0.00      0.89      0.02      0.00     91.97
11:10:01 AM       all      7.27      0.14      0.94      0.02      0.00     91.62
11:20:01 AM       all      7.13      0.00      0.91      0.02      0.00     91.93
11:30:01 AM       all      7.01      0.00      0.87      0.02      0.00     92.10
11:40:01 AM       all      7.15      0.00      0.93      0.02      0.00     91.90
11:50:01 AM       all      6.30      0.00      0.89      0.02      0.00     92.79
12:00:01 PM       all      6.43      0.00      0.87      0.02      0.00     92.68
12:10:01 PM       all      6.90      0.16      0.94      0.02      0.00     91.98
12:20:01 PM       all      6.94      0.00      0.92      0.02      0.00     92.12
12:30:01 PM       all      6.50      0.00      0.77     30.73      0.00     62.00
12:40:01 PM       all      6.56      0.00      0.81     25.48      0.00     67.14
12:50:01 PM       all      6.78      0.00      0.83     29.41      0.00     62.97
01:00:01 PM       all      6.47      0.00      0.79     20.80      0.00     71.94
01:10:01 PM       all      6.68      0.14      0.82     30.02      0.00     62.34
01:20:01 PM       all      6.62      0.00      0.81     27.91      0.00     64.65
01:30:01 PM       all      6.08      0.00      0.64     29.25      0.00     64.02
01:40:01 PM       all      6.61      0.00      0.81     30.01      0.00     62.58
01:50:01 PM       all      5.94      0.00      0.78     28.15      0.00     65.13
02:00:01 PM       all      6.17      0.00      0.82     26.50      0.00     66.51
02:10:01 PM       all      5.89      0.14      0.63     25.18      0.00     68.16
02:20:01 PM       all      5.13      0.00      0.64     29.71      0.00     64.51
02:30:01 PM       all      5.49      0.00      0.60     40.94      0.00     52.97
Average:          all      5.70      0.02      0.81     16.76      0.00     76.71

Re: Nagios performance trouble

Posted: Tue Sep 18, 2012 2:22 pm
by slansing
Take a look at and run the postgres vaccum:
Postgresql: Postmaster CPU Is High or "Transaction wraparound limit" in log

Although Nagios XI performance routine database maintenance on the postgres data tables, if you notice either a high CPU usage for the postmaster process, or a repeated error message in the /var/lib/pgsql/data/pg_log file that says "transaction ID wrap limit is 2147484146", then you may need to perform a manual VACUUM of the postgres databases. Run the following commands from the command line:

psql nagiosxi nagiosxi
VACUUM;
VACUUM ANALYZE;
VACUUM FULL;
\q

After which could you post the output of the two following commands separately?:

Code: Select all

df -h

tail /var/log/mysqld.log

Re: Nagios performance trouble

Posted: Tue Sep 18, 2012 2:27 pm
by hhlodge

Code: Select all

nagiosxi=> VACUUM;
WARNING:  skipping "pg_authid" --- only table or database owner can vacuum it

WARNING:  skipping "pg_tablespace" --- only table or database owner can vacuum it
WARNING:  skipping "pg_pltemplate" --- only table or database owner can vacuum it
WARNING:  skipping "pg_shdepend" --- only table or database owner can vacuum it
WARNING:  skipping "pg_auth_members" --- only table or database owner can vacuum it
WARNING:  skipping "pg_database" --- only table or database owner can vacuum it
VACUUM
nagiosxi=> 
nagiosxi=> VACUUM ANALYZE;
WARNING:  skipping "pg_authid" --- only table or database owner can vacuum it
WARNING:  skipping "pg_tablespace" --- only table or database owner can vacuum it
WARNING:  skipping "pg_pltemplate" --- only table or database owner can vacuum it
WARNING:  skipping "pg_shdepend" --- only table or database owner can vacuum it
WARNING:  skipping "pg_auth_members" --- only table or database owner can vacuum it
WARNING:  skipping "pg_database" --- only table or database owner can vacuum it
VACUUM
nagiosxi=> VACUUM FULL;
WARNING:  skipping "pg_authid" --- only table or database owner can vacuum it
WARNING:  skipping "pg_tablespace" --- only table or database owner can vacuum it
WARNING:  skipping "pg_pltemplate" --- only table or database owner can vacuum it
WARNING:  skipping "pg_shdepend" --- only table or database owner can vacuum it
WARNING:  skipping "pg_auth_members" --- only table or database owner can vacuum it
WARNING:  skipping "pg_database" --- only table or database owner can vacuum it
VACUUM
nagiosxi=> \q

Code: Select all

 [root@psm-itmon tmp]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                       57G  6.1G   48G  12% /
/dev/mapper/VolGroup01-LogVol00
                      133G   32G   95G  26% /usr/local
/dev/cciss/c0d0p1      99M   20M   75M  21% /boot
tmpfs                 3.9G     0  3.9G   0% /dev/shm

Code: Select all

[root@psm-itmon tmp]# tail /var/log/mysqld.log
120918 14:45:44 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_servicestatus.MYI'; try to repair it
120918 14:45:44 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_servicestatus.MYI'; try to repair it
120918 14:45:44 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_servicestatus.MYI'; try to repair it
120918 14:45:44 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_servicestatus.MYI'; try to repair it
120918 14:45:44 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_servicestatus.MYI'; try to repair it
120918 14:45:44 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_servicestatus.MYI'; try to repair it
120918 14:53:24 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_servicestatus.MYI'; try to repair it
120918 14:53:24 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_servicestatus.MYI'; try to repair it
120918 14:53:24 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_servicestatus.MYI'; try to repair it
120918 14:53:24 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_servicestatus.MYI'; try to repair it

Re: Nagios performance trouble

Posted: Tue Sep 18, 2012 2:34 pm
by slansing
It looks like you may want to run a database repair:

http://assets.nagios.com/downloads/nagi ... tabase.pdf

Code: Select all

service mysqld stop

/usr/local/nagiosxi/scripts/repairmysql.sh naigos

service mysqld start
And then run the tail on mysqld again and post the output here.

Re: Nagios performance trouble

Posted: Tue Sep 18, 2012 2:37 pm
by hhlodge
Done.

Code: Select all

root@psm-itmon nagios]# tail /var/log/mysqld.log
120918 15:37:05  InnoDB: Starting shutdown...
120918 15:37:06  InnoDB: Shutdown completed; log sequence number 0 43655
120918 15:37:06 [Note] /usr/libexec/mysqld: Shutdown complete

120918 15:37:06  mysqld ended

120918 15:37:06  mysqld started
120918 15:37:06  InnoDB: Started; log sequence number 0 43655
120918 15:37:06 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.0.77'  socket: '/usr/local/var/lib/mysql/mysql.sock'  port: 3306  Source distribution

Re: Nagios performance trouble

Posted: Tue Sep 18, 2012 2:44 pm
by slansing
How is your CPU load looking now?