System is slow, CPU usage skyhigh

r.jaynes · Post by **r.jaynes** » Tue Apr 19, 2011 4:13 pm

Recently I upgraded our server to 2011R1.1, and I'm experiencing a high CPU load as well. Also, I'll see "localhost" in nagios with a critical error for the load times, and sometimes the SMTP process will report as not responding. This never occurred in the 2009 versions that I can recall. Also, browsing the Nagios web interface is affected.

Looking at my process list, there are quite a few "postmaster" entries, but they are all at 0% CPU usage:

Current load average: 3.16, 4.87, 6.55
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
689 postgres 17 0 21900 3856 3104 S 0.0 0.7 0:00.02 postmaster
1198 postgres 15 0 21900 3812 3060 S 0.0 0.7 0:00.01 postmaster
1204 postgres 15 0 21900 3924 3152 S 0.0 0.8 0:00.01 postmaster
1208 postgres 15 0 21900 3804 3052 S 0.0 0.7 0:00.01 postmaster
1212 postgres 15 0 21900 4500 3664 S 0.3 0.9 0:00.04 postmaster
1218 postgres 15 0 21900 4220 3444 S 0.0 0.8 0:00.02 postmaster
1223 postgres 15 0 21900 4240 3424 S 0.0 0.8 0:00.01 postmaster
1287 postgres 16 0 21900 2864 2228 S 0.0 0.6 0:00.00 postmaster
2120 postgres 15 0 21228 1648 1524 S 0.0 0.3 3:17.78 postmaster
2186 postgres 15 0 11008 416 364 S 0.0 0.1 0:04.07 postmaster
2190 postgres 15 0 21228 3032 2888 S 0.0 0.6 0:41.71 postmaster
2191 postgres 16 0 12008 360 276 S 0.0 0.1 0:52.40 postmaster
2192 postgres 15 0 11188 584 384 S 0.0 0.1 0:52.65 postmaster
4668 postgres 15 0 21900 4196 3656 S 0.0 0.8 0:01.21 postmaster
4675 postgres 15 0 21900 4188 3652 S 0.0 0.8 0:01.23 postmaster
5262 postgres 15 0 21900 4752 4048 S 0.0 0.9 0:01.25 postmaster
5276 postgres 15 0 21900 4768 4056 S 0.0 0.9 0:01.20 postmaster
5287 postgres 17 0 21900 4752 4044 S 0.0 0.9 0:01.23 postmaster
5302 postgres 22 0 21900 4772 4064 S 0.0 0.9 0:01.21 postmaster
5330 postgres 15 0 21900 4820 4112 S 0.0 0.9 0:01.14 postmaster
5352 postgres 15 0 21900 4736 4032 S 0.0 0.9 0:01.19 postmaster
5360 postgres 22 0 21900 4796 4088 S 0.0 0.9 0:01.24 postmaster
5392 postgres 15 0 21900 4780 4068 S 0.0 0.9 0:01.30 postmaster
11043 postgres 23 0 21900 4836 4124 S 0.0 0.9 0:12.53 postmaster
11119 postgres 15 0 21900 4792 4080 S 0.0 0.9 0:12.57 postmaster
11234 postgres 17 0 21900 4944 4220 S 0.0 1.0 0:12.69 postmaster
11716 postgres 15 0 21900 4836 4120 S 0.0 0.9 0:12.72 postmaster
12044 postgres 15 0 21900 4852 4132 S 0.0 0.9 0:03.37 postmaster
13071 postgres 15 0 21900 4320 3776 S 0.0 0.8 0:13.02 postmaster
13206 postgres 15 0 21900 4388 3760 S 0.0 0.9 0:12.62 postmaster
16236 postgres 15 0 21900 4800 4084 S 0.0 0.9 0:09.01 postmaster
17396 postgres 15 0 21900 4832 4120 S 0.0 0.9 0:05.38 postmaster
27995 postgres 15 0 21900 5000 4280 S 0.0 1.0 0:03.96 postmaster

r.jaynes · Post by **r.jaynes** » Tue Apr 19, 2011 4:19 pm

I checked for "ps aux | grep dbmaint" and didn't see that process running either. I manually ran it as well, and the third line below appeared. I've seen this before when we were adding mibs for the SNMP trap service. Could our number of mibs be slowing the server down so much?

[root@monitor ~]# /usr/local/nagiosxi/cron/dbmaint.php
No log handling enabled - turning on stderr logging
add_mibdir: strings scanned in from /usr/share/snmp/mibs/.index are too large. count = 143
CLEANING ndoutils TABLE 'commenthistory'...
SQL: DELETE FROM nagios_commenthistory WHERE entry_time < FROM_UNIXTIME(1271711716)
CLEANING ndoutils TABLE 'processevents'...
SQL: DELETE FROM nagios_processevents WHERE event_time < FROM_UNIXTIME(1271711716)
CLEANING ndoutils TABLE 'externalcommands'...
SQL: DELETE FROM nagios_externalcommands WHERE entry_time < FROM_UNIXTIME(1302642916)
CLEANING ndoutils TABLE 'logentries'...
SQL: DELETE FROM nagios_logentries WHERE logentry_time < FROM_UNIXTIME(1295471716)
CLEANING ndoutils TABLE 'notifications'...
SQL: DELETE FROM nagios_notifications WHERE start_time < FROM_UNIXTIME(1295471716)
CLEANING ndoutils TABLE 'contactnotifications'...
SQL: DELETE FROM nagios_contactnotifications WHERE start_time < FROM_UNIXTIME(1295471716)
CLEANING ndoutils TABLE 'contactnotificationmethods'...
SQL: DELETE FROM nagios_contactnotificationmethods WHERE start_time < FROM_UNIXTIME(1295471716)
CLEANING ndoutils TABLE 'statehistory'...
SQL: DELETE FROM nagios_statehistory WHERE state_time < FROM_UNIXTIME(1240175716)
CLEANING ndoutils TABLE 'timedevents'...
SQL: DELETE FROM nagios_timedevents WHERE event_time < FROM_UNIXTIME(1303247416)
CLEANING ndoutils TABLE 'systemcommands'...
SQL: DELETE FROM nagios_systemcommands WHERE start_time < FROM_UNIXTIME(1303247416)
CLEANING ndoutils TABLE 'servicechecks'...
SQL: DELETE FROM nagios_servicechecks WHERE start_time < FROM_UNIXTIME(1303247416)
CLEANING ndoutils TABLE 'hostchecks'...
SQL: DELETE FROM nagios_hostchecks WHERE start_time < FROM_UNIXTIME(1303247416)
CLEANING ndoutils TABLE 'eventhandlers'...
SQL: DELETE FROM nagios_eventhandlers WHERE start_time < FROM_UNIXTIME(1303247416)
LASTOPT: 1303246524
INTERVAL: 60
NOW: 1303247716
OPTTIME: 1303250124
CLEANING nagiosxi TABLE 'commands'...
SQL: DELETE FROM xi_commands WHERE processing_time < 1303218916::abstime::timestamp without time zone
CLEANING nagiosxi TABLE 'events'...
SQL: DELETE FROM xi_events WHERE processing_time < 1303218916::abstime::timestamp without time zone
SQL1: SELECT xi_meta.meta_id FROM xi_meta LEFT JOIN xi_events ON xi_meta.metaobj_id=xi_events.event_id WHERE metatype_id='1' AND event_id IS NULL
SQL2: DELETE FROM xi_meta WHERE meta_id IN (SELECT xi_meta.meta_id FROM xi_meta LEFT JOIN xi_events ON xi_meta.metaobj_id=xi_events.event_id WHERE metatype_id='1' AND event_id IS NULL)
CLEANING nagiosql TABLE 'logbook'...
SQL: DELETE FROM tbl_logbook WHERE time < FROM_UNIXTIME(1303218916)

rdedon · Post by **rdedon** » Wed Apr 20, 2011 2:07 pm

I am wondering if this could be a case of MIB's not being recognized that were last configured, doing a little research I am seeing something of this nature.

Symfoni · Post by **Symfoni** » Mon May 02, 2011 7:21 am

Running dbmaint on our server produces the same kind of output as i copied into my post of january 6.
No mention of MIBs in the output, so the problem doesn't seem to be related to that.

There is a bit of disk i/o on the system, but since neither 'top' nor 'vmstat' reports any waiting on i/o (aside from just after a restart of the machine but it goes down to 0 after a little while), that doesn't seem to be the problem either.
'vmstat' does indicate a high number of processes "waiting for run time" though, which makes sense given that postmaster/postgres is hogging the cpu.
'vmstat' also says that the cpu share for "Time spent running non-kernel code" is around 70-80%, while "Time spent running kernel code" is about 20-30%. Time spent idling and waiting for I/O is zero.

We are still on version 2009R1.4B. I couldn't find any mention of postgres performance changes in the changelogs for the new 2011R1 and 2011R1.1 versions, only improvements in mysql, so i'm assuming our issue is quite uncommon and hasn't been resolved in the latest versions.

The items mentioned in http://wiki.postgresql.org/wiki/Priorities don't resolve our issue. There is no memory problem, nor is disk I/O a problem, which only leaves cpu. The method suggested to prioritize cpu (using 'nice') looks to me to be something that needs to be done during the coding-phase, and i'd prefer not to try and change the official code for nagiosxi in case i irreparably break something or destroy data.
Do you have any suggestions how to further track down and resolve the problem?

mguthrie · Post by **mguthrie** » Mon May 02, 2011 10:26 am

Lets try running some queries against the postgresql data base and see if anything stalls out. I'm suspicious there is damage in the somewhere in the postgres database, but it's hard to say for sure. As of yet we haven't had this issue reported by anyone else and we haven't ever been able to replicate it, so it's hard to pin point it exactly. Try running the below queries, and take note of any error messages, or if any of the queries take more than 2 or 3 seconds.

Code: Select all

psql nagiosxi nagiosxi
\d                                
select count(*) from xi_commands;
select count(*) from xi_events;
select count(*) from xi_meta;
select count(*) from xi_options;
select count(*) from xi_sysstat;
select count(*) from xi_usermeta;
select count(*) from xi_users;

The maintenance and cleaning commands are below, you can try running these as well. You'll get some warnings about not having permissions to some of the built-in postgres tables (those are normal), but post any error messages that might imply table damage or corruption.

Code: Select all

vacuum;
vacuum analyze;
vacuum full;

What would be your thoughts about setting up an alternate test system on a second box or VM? Your license covers a test install, production install, and DR/Backup install.

mtkaschools · Post by **mtkaschools** » Mon May 02, 2011 4:44 pm

I too can't seem to shake the local host current load from being flagged running Nagios XI 2011R1.2. I have about 300 hosts and 1250 services being monitored. We keep throwing more RAM and CPU at it, but it just sucks it all up and basically says 'thanks', then triggers that error again about the current load.

Frustrating!

Symfoni · Post by **Symfoni** » Wed May 04, 2011 9:00 am

I ran the commands you mentioned. Here are the results:

select count(*) from xi_commands; took one or two seconds, result 0 (1 row)
select count(*) from xi_events; took about 15-20 seconds, result 656 (1 row)
select count(*) from xi_meta; took nearly 30 seconds, results 719 (1 row)
select count(*) from xi_options; took maybe 1 second, results 38 (1 row)
select count(*) from xi_sysstat; took about 25 seconds, results 16 (1 row)
select count(*) from xi_usermeta; took less than 1 second, results 155 (1 row)
select count(*) from xi_users; took just under 10 seconds, results 5 (1 row)

vacuum;
WARNING: skipping "pg_authid" --- only table or database owner can vacuum it
WARNING: skipping "pg_tablespace" --- only table or database owner can vacuum i
WARNING: skipping "pg_pltemplate" --- only table or database owner can vacuum i
WARNING: skipping "pg_shdepend" --- only table or database owner can vacuum it
WARNING: skipping "pg_auth_members" --- only table or database owner can vacuum
WARNING: skipping "pg_database" --- only table or database owner can vacuum it
took ages to complete, about three-ish hours.

vacuum analyze;
WARNING: skipping "pg_authid" --- only table or database owner can vacuum it
WARNING: skipping "pg_tablespace" --- only table or database owner can vacuum it
WARNING: skipping "pg_pltemplate" --- only table or database owner can vacuum it
WARNING: skipping "pg_shdepend" --- only table or database owner can vacuum it
WARNING: skipping "pg_auth_members" --- only table or database owner can vacuum it
WARNING: skipping "pg_database" --- only table or database owner can vacuum it
VACUUM
only took a few moments, maybe a minute.

vacuum full;
WARNING: skipping "pg_authid" --- only table or database owner can vacuum it
WARNING: skipping "pg_tablespace" --- only table or database owner can vacuum it
WARNING: skipping "pg_pltemplate" --- only table or database owner can vacuum it
WARNING: skipping "pg_shdepend" --- only table or database owner can vacuum it
WARNING: skipping "pg_auth_members" --- only table or database owner can vacuum it
WARNING: skipping "pg_database" --- only table or database owner can vacuum it
VACUUM
only took a few moments, maybe a minute.

After running the first vacuum, the system seemed more responsive, and then after completing the last vacuum, i had a look at 'top' and could hardly see any postmaster/postgres processes when sorted by processor usage and putty in fullscreen, let alone any that were using a double digit percentage of cpu.
That makes me wonder if there have been scheduled vacuumings that ran, but didn't complete within an allotted amount of time and got killed before they had a chance to clean up everything, which snowballed together to create a huge need for several hours worth of vacuuming.

It's only been running like this for a couple of hours now, so i'll be letting it run at least until tomorrow without making any changes to the setup so see if the problem reappears or if it's been solved.

Have you seen any similar problems with other installations, and have any tips to prevent this from happening again in the future?
Or do you think we should upgrade nagiosxi to the latest version and create a cronjob that ran, say, every night or every week to vacuum postgres?

mguthrie · Post by **mguthrie** » Wed May 04, 2011 9:13 am

I too can't seem to shake the local host current load from being flagged running Nagios XI 2011R1.2. I have about 300 hosts and 1250 services being monitored. We keep throwing more RAM and CPU at it, but it just sucks it all up and basically says 'thanks', then triggers that error again about the current load.

mtkaschools, can you repost this in a new thread and give some details as to what hardware you're running on and what your CPU load runs at for a 5 and 15 mn average?

mguthrie · Post by **mguthrie** » Wed May 04, 2011 9:18 am

Have you seen any similar problems with other installations, and have any tips to prevent this from happening again in the future?

We saw something like this in a few mysql tables, but this is the first time we've seen it in a postgresql table. We did some adjustments to our db maintenance script, but I'll look it over again, because what you're describing about the vacuum job timing out and starting up again is probably exactly what happened.

Or do you think we should upgrade nagiosxi to the latest version and create a cronjob that ran, say, every night or every week to vacuum postgres?

Always upgrade to the latest version as they come available. They will have the latest bug fixes and security updates. The cron job already exists, and it does have a maximum allowed time that it can run, which as we're both guess, is probably the reason for this. What version are you running, because we've made some updates in our latest release to that script in particular, but I might make a few mods to it once more just to make sure a situation like this doesn't occur again.

Symfoni · Post by **Symfoni** » Fri May 06, 2011 2:56 am

mguthrie wrote:
Have you seen any similar problems with other installations, and have any tips to prevent this from happening again in the future?
We saw something like this in a few mysql tables, but this is the first time we've seen it in a postgresql table. We did some adjustments to our db maintenance script, but I'll look it over again, because what you're describing about the vacuum job timing out and starting up again is probably exactly what happened.

Will the changes be in the next version/upgrade of NagiosXI, or will they be available as a fixpack before the release of the next version?

Or do you think we should upgrade nagiosxi to the latest version and create a cronjob that ran, say, every night or every week to vacuum postgres?
Always upgrade to the latest version as they come available. They will have the latest bug fixes and security updates. The cron job already exists, and it does have a maximum allowed time that it can run, which as we're both guess, is probably the reason for this. What version are you running, because we've made some updates in our latest release to that script in particular, but I might make a few mods to it once more just to make sure a situation like this doesn't occur again.

We were using version 2009R1.4B when we had the problem, After doing a bit of vacuuming a couple of times, postgres had almost stopped using any cpu at all, and as a result the system was much more responsive. The next day, i couldn't log in using ssh at all,while the web-gui worked just fine. Luckily i had webmin installed and used that to initiate a reboot, after which i could log in again using ssh. Then, seeing as it's been a while since we installed the system, and since the worst case scenario of a total reinstall in case things went sour was already an idea being thrown around as a possible fix for the postgres-problem, we yum updated CentOS to latest and updated NagiosXI to 2011R1.2. The system has been working well and stable since then, and the load average right now is 1.97, 2.47, 2.27, which is a huge improvement from just a couple of days ago. We will be keeping an eye on things though, in case postgres gets bogged up again, but so far it's looking very promising.

Nagios Support Forum

System is slow, CPU usage skyhigh

Re: System is slow, CPU usage skyhigh

Re: System is slow, CPU usage skyhigh

Re: System is slow, CPU usage skyhigh

Re: System is slow, CPU usage skyhigh

Re: System is slow, CPU usage skyhigh

Re: System is slow, CPU usage skyhigh

Re: System is slow, CPU usage skyhigh

Re: System is slow, CPU usage skyhigh

Re: System is slow, CPU usage skyhigh

Re: System is slow, CPU usage skyhigh