Page 1 of 5
Nagios performance trouble
Posted: Tue Sep 18, 2012 1:21 pm
by hhlodge
Hi,
Over the past couple weeks I've been getting more and more notifications for CPU load on my XI server. It's physical server with CentOS 5.6 64 bit with XI 2011R3.2. Now it's just starting to perform badly in the web interfaces and I have php commands timing out in the logs. I can see loads in the 8-10 range often now and vmstat shows many processes blocked. I've got 8GB of RAM and 1.5GB unused so it's not swapping. Something is suddenly using a lot of IO. In top, httpd and mysqld are usually the top CPU users. Based on some reading, this may show what processes are blocking:
Code: Select all
# ps aux | grep " [D]"
root 2349 0.0 0.0 0 0 ? D< Sep17 0:28 [kjournald]
postgres 4270 0.0 0.1 120880 8304 ? D Sep17 0:00 postgres: writer process
root 7549 0.0 0.0 0 0 ? D 14:12 0:00 [pdflush]
postgres 7784 0.0 0.0 122060 4204 ? D 14:13 0:00 postgres: nagiosxi nagiosxi 127.0.0.1(40864) UPDATE
I haven't added many new hosts or services so I suspect something is getting to where it needs some cleanup and/or tuning. Any suggestions?
Re: Nagios performance trouble
Posted: Tue Sep 18, 2012 1:42 pm
by slansing
Check your tmp directory for temporary check files:
You may see an over abundance of files that look similar to this:
Delete all of them by issuing this:
Have you experienced any database related issues?
Re: Nagios performance trouble
Posted: Tue Sep 18, 2012 1:44 pm
by hhlodge
There was only 15 of those check files and I've removed them. I am not aware of any database issues but I've not looked for any signs of trouble either.
Re: Nagios performance trouble
Posted: Tue Sep 18, 2012 1:50 pm
by slansing
if you tail the system log are you seeing anything strange?:
I would also recommend checking this out:
http://assets.nagios.com/downloads/nagi ... rmance.pdf
Re: Nagios performance trouble
Posted: Tue Sep 18, 2012 2:10 pm
by hhlodge
This looks grim.
Code: Select all
Sep 18 02:14:46 psm-itmon nagios: Warning: Global service event handler command '/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_event.php --handler-type=service
--host="localhost" --service="Total Processes" --hostaddress="127.0.0.1" --hoststate=UP --hoststateid=0 --hosteventid=0 --hostproblemid=0 --servicestate=WARNING --servicesta
teid=1 --lastservicestate=WARNING --lastservicestateid=1 --servicestatetype=HARD --currentattempt=4 --maxattempts=4 --serviceeventid=200278 --serviceproblemid=90283 --servic
eoutput="PROCS WARNING: 296 processes with STATE = RSZDT" --longserviceoutput=""' timed out after 30 seconds
Sep 18 02:14:46 psm-itmon nagios: SERVICE ALERT: BTMSPROJECT;Memory Usage;WARNING;HARD;3;WARNING - [Triggered by _MemUsed%>80] - Physical Memory: Total: 2.999GB - Used: 2.50
2GB (83%) - Free: 0.497GB (17%)
Sep 18 02:15:02 psm-itmon kernel: INFO: task postmaster:4270 blocked for more than 120 seconds.
Sep 18 02:15:02 psm-itmon kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 18 02:15:02 psm-itmon kernel: postmaster D ffff810001015120 0 4270 4186 4271 4252 (NOTLB)
Sep 18 02:15:02 psm-itmon kernel: ffff81021a76fd88 0000000000000086 0000000000000296 0000000000000003
Sep 18 02:15:02 psm-itmon kernel: ffff81021a76fd18 0000000000000009 ffff81021bb65040 ffff81022ff26080
Sep 18 02:15:02 psm-itmon kernel: 0000384bdb78ada6 00000000000007bc ffff81021bb65228 000000027ff25f90
Sep 18 02:15:02 psm-itmon kernel: Call Trace:
Sep 18 02:15:02 psm-itmon kernel: [<ffffffff88036d8a>] :jbd:log_wait_commit+0xa3/0xf5
Sep 18 02:15:02 psm-itmon kernel: [<ffffffff800a28f3>] autoremove_wake_function+0x0/0x2e
Sep 18 02:15:02 psm-itmon kernel: [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
Sep 18 02:15:02 psm-itmon kernel: [<ffffffff8002fe49>] __writeback_single_inode+0x1d9/0x318
Sep 18 02:15:02 psm-itmon kernel: [<ffffffff8001ebc1>] __dentry_open+0x101/0x1dc
Sep 18 02:15:02 psm-itmon kernel: [<ffffffff800f5a1a>] sync_inode+0x24/0x33
Sep 18 02:15:02 psm-itmon kernel: [<ffffffff8804c370>] :ext3:ext3_sync_file+0xcc/0xf8
Sep 18 02:15:02 psm-itmon kernel: [<ffffffff800504d1>] do_fsync+0x52/0xa4
Sep 18 02:15:02 psm-itmon kernel: [<ffffffff800e3463>] __do_fsync+0x23/0x36
Sep 18 02:15:02 psm-itmon kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Sep 18 02:15:02 psm-itmon kernel:
Sep 18 02:15:02 psm-itmon kernel: INFO: task postmaster:1760 blocked for more than 120 seconds.
Sep 18 02:15:02 psm-itmon kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 18 02:15:02 psm-itmon kernel: postmaster D ffff81000100caa0 0 1760 4186 1770 32690 (NOTLB)
Sep 18 02:15:02 psm-itmon kernel: ffff810074231ea8 0000000000000086 00000000ffffffd8 ffffffff8001264b
Sep 18 02:15:02 psm-itmon kernel: 800000019cc16067 000000000000000a ffff8100703c1100 ffff81022ff17100
Sep 18 02:15:02 psm-itmon kernel: 00003856c6672660 0000000000050eb3 ffff8100703c12e8 000000018002232e
Sep 18 02:15:02 psm-itmon kernel: Call Trace:
Sep 18 02:15:02 psm-itmon kernel: [<ffffffff8001264b>] may_open+0x65/0x22f
Sep 18 02:15:02 psm-itmon kernel: [<ffffffff80063c4f>] __mutex_lock_slowpath+0x60/0x9b
Sep 18 02:15:02 psm-itmon kernel: [<ffffffff80063c99>] .text.lock.mutex+0xf/0x14
Sep 18 02:15:02 psm-itmon kernel: [<ffffffff80013eef>] generic_file_llseek+0x2a/0x8b
Sep 18 02:15:02 psm-itmon kernel: [<ffffffff80025707>] sys_lseek+0x40/0x60
Sep 18 02:15:02 psm-itmon kernel: [<ffffffff8005d229>] tracesys+0x71/0xe0
Sep 18 02:15:02 psm-itmon kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0
And my iowait sar data up to this time looks increasingly dire.
Code: Select all
08:10:01 AM CPU %user %nice %system %iowait %steal %idle
08:20:01 AM all 5.31 0.00 0.86 0.02 0.00 93.82
08:30:01 AM all 5.29 0.00 0.83 0.02 0.00 93.87
08:40:01 AM all 5.30 0.00 0.87 0.02 0.00 93.81
08:50:01 AM all 6.34 0.00 0.93 0.02 0.00 92.71
09:00:01 AM all 6.60 0.00 0.87 0.02 0.00 92.51
09:10:01 AM all 7.18 0.14 0.96 0.02 0.00 91.70
09:20:01 AM all 7.22 0.00 0.94 0.02 0.00 91.82
09:30:01 AM all 6.82 0.00 0.88 0.02 0.00 92.28
09:40:01 AM all 6.80 0.00 0.93 0.02 0.00 92.25
09:50:01 AM all 7.48 0.00 0.98 0.02 0.00 91.52
10:00:01 AM all 7.07 0.00 0.88 0.02 0.00 92.03
10:10:01 AM all 7.16 0.14 0.94 0.02 0.00 91.73
10:20:01 AM all 7.08 0.00 0.93 0.02 0.00 91.97
10:30:01 AM all 7.11 0.00 0.89 0.02 0.00 91.98
10:40:01 AM all 7.26 0.00 0.94 0.02 0.00 91.78
10:50:01 AM all 7.28 0.00 0.93 0.03 0.00 91.76
11:00:01 AM all 7.13 0.00 0.89 0.02 0.00 91.97
11:10:01 AM all 7.27 0.14 0.94 0.02 0.00 91.62
11:20:01 AM all 7.13 0.00 0.91 0.02 0.00 91.93
11:30:01 AM all 7.01 0.00 0.87 0.02 0.00 92.10
11:40:01 AM all 7.15 0.00 0.93 0.02 0.00 91.90
11:50:01 AM all 6.30 0.00 0.89 0.02 0.00 92.79
12:00:01 PM all 6.43 0.00 0.87 0.02 0.00 92.68
12:10:01 PM all 6.90 0.16 0.94 0.02 0.00 91.98
12:20:01 PM all 6.94 0.00 0.92 0.02 0.00 92.12
12:30:01 PM all 6.50 0.00 0.77 30.73 0.00 62.00
12:40:01 PM all 6.56 0.00 0.81 25.48 0.00 67.14
12:50:01 PM all 6.78 0.00 0.83 29.41 0.00 62.97
01:00:01 PM all 6.47 0.00 0.79 20.80 0.00 71.94
01:10:01 PM all 6.68 0.14 0.82 30.02 0.00 62.34
01:20:01 PM all 6.62 0.00 0.81 27.91 0.00 64.65
01:30:01 PM all 6.08 0.00 0.64 29.25 0.00 64.02
01:40:01 PM all 6.61 0.00 0.81 30.01 0.00 62.58
01:50:01 PM all 5.94 0.00 0.78 28.15 0.00 65.13
02:00:01 PM all 6.17 0.00 0.82 26.50 0.00 66.51
02:10:01 PM all 5.89 0.14 0.63 25.18 0.00 68.16
02:20:01 PM all 5.13 0.00 0.64 29.71 0.00 64.51
02:30:01 PM all 5.49 0.00 0.60 40.94 0.00 52.97
Average: all 5.70 0.02 0.81 16.76 0.00 76.71
Re: Nagios performance trouble
Posted: Tue Sep 18, 2012 2:22 pm
by slansing
Take a look at and run the postgres vaccum:
Postgresql: Postmaster CPU Is High or "Transaction wraparound limit" in log
Although Nagios XI performance routine database maintenance on the postgres data tables, if you notice either a high CPU usage for the postmaster process, or a repeated error message in the /var/lib/pgsql/data/pg_log file that says "transaction ID wrap limit is 2147484146", then you may need to perform a manual VACUUM of the postgres databases. Run the following commands from the command line:
psql nagiosxi nagiosxi
VACUUM;
VACUUM ANALYZE;
VACUUM FULL;
\q
After which could you post the output of the two following commands separately?:
Re: Nagios performance trouble
Posted: Tue Sep 18, 2012 2:27 pm
by hhlodge
Code: Select all
nagiosxi=> VACUUM;
WARNING: skipping "pg_authid" --- only table or database owner can vacuum it
WARNING: skipping "pg_tablespace" --- only table or database owner can vacuum it
WARNING: skipping "pg_pltemplate" --- only table or database owner can vacuum it
WARNING: skipping "pg_shdepend" --- only table or database owner can vacuum it
WARNING: skipping "pg_auth_members" --- only table or database owner can vacuum it
WARNING: skipping "pg_database" --- only table or database owner can vacuum it
VACUUM
nagiosxi=>
nagiosxi=> VACUUM ANALYZE;
WARNING: skipping "pg_authid" --- only table or database owner can vacuum it
WARNING: skipping "pg_tablespace" --- only table or database owner can vacuum it
WARNING: skipping "pg_pltemplate" --- only table or database owner can vacuum it
WARNING: skipping "pg_shdepend" --- only table or database owner can vacuum it
WARNING: skipping "pg_auth_members" --- only table or database owner can vacuum it
WARNING: skipping "pg_database" --- only table or database owner can vacuum it
VACUUM
nagiosxi=> VACUUM FULL;
WARNING: skipping "pg_authid" --- only table or database owner can vacuum it
WARNING: skipping "pg_tablespace" --- only table or database owner can vacuum it
WARNING: skipping "pg_pltemplate" --- only table or database owner can vacuum it
WARNING: skipping "pg_shdepend" --- only table or database owner can vacuum it
WARNING: skipping "pg_auth_members" --- only table or database owner can vacuum it
WARNING: skipping "pg_database" --- only table or database owner can vacuum it
VACUUM
nagiosxi=> \q
Code: Select all
[root@psm-itmon tmp]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
57G 6.1G 48G 12% /
/dev/mapper/VolGroup01-LogVol00
133G 32G 95G 26% /usr/local
/dev/cciss/c0d0p1 99M 20M 75M 21% /boot
tmpfs 3.9G 0 3.9G 0% /dev/shm
Code: Select all
[root@psm-itmon tmp]# tail /var/log/mysqld.log
120918 14:45:44 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_servicestatus.MYI'; try to repair it
120918 14:45:44 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_servicestatus.MYI'; try to repair it
120918 14:45:44 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_servicestatus.MYI'; try to repair it
120918 14:45:44 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_servicestatus.MYI'; try to repair it
120918 14:45:44 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_servicestatus.MYI'; try to repair it
120918 14:45:44 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_servicestatus.MYI'; try to repair it
120918 14:53:24 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_servicestatus.MYI'; try to repair it
120918 14:53:24 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_servicestatus.MYI'; try to repair it
120918 14:53:24 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_servicestatus.MYI'; try to repair it
120918 14:53:24 [ERROR] /usr/libexec/mysqld: Incorrect key file for table './nagios/nagios_servicestatus.MYI'; try to repair it
Re: Nagios performance trouble
Posted: Tue Sep 18, 2012 2:34 pm
by slansing
It looks like you may want to run a database repair:
http://assets.nagios.com/downloads/nagi ... tabase.pdf
Code: Select all
service mysqld stop
/usr/local/nagiosxi/scripts/repairmysql.sh naigos
service mysqld start
And then run the tail on mysqld again and post the output here.
Re: Nagios performance trouble
Posted: Tue Sep 18, 2012 2:37 pm
by hhlodge
Done.
Code: Select all
root@psm-itmon nagios]# tail /var/log/mysqld.log
120918 15:37:05 InnoDB: Starting shutdown...
120918 15:37:06 InnoDB: Shutdown completed; log sequence number 0 43655
120918 15:37:06 [Note] /usr/libexec/mysqld: Shutdown complete
120918 15:37:06 mysqld ended
120918 15:37:06 mysqld started
120918 15:37:06 InnoDB: Started; log sequence number 0 43655
120918 15:37:06 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.0.77' socket: '/usr/local/var/lib/mysql/mysql.sock' port: 3306 Source distribution
Re: Nagios performance trouble
Posted: Tue Sep 18, 2012 2:44 pm
by slansing
How is your CPU load looking now?