System is slow, CPU usage skyhigh

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: System is slow, CPU usage skyhigh

Post by mguthrie »

We ran some tests a few days ago with a server that had 3000 checks, and we began opening browsers and watching the CPU usage. We did notice that there is a fairly substantial performance grabs from the server as more and more browser windows opened up. With 11 windows open our CPU (single core VM) was at about 85%.

We've got a config option that we put into our upcoming release that allows the user to back down the frequency of the AJAX requests from the browsers, which created a pretty substantial improvement when several browser windows are open.

However, in reviewing your posts, there does still appear to be something not quite with postgres, because mysql and httpd should almost always be the top processes. Let me pass this up once more and see if we can come up with any other suggestions for you.
User avatar
admin
Site Admin
Posts: 256
Joined: Mon Oct 12, 2009 8:21 am

Re: System is slow, CPU usage skyhigh

Post by admin »

Hmm... I don't know of anything else offhand that we could test. Does anyone else know offhand what might be the issue?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Ethan Galstad
President
Symfoni
Posts: 17
Joined: Wed Jan 05, 2011 5:46 pm

Re: System is slow, CPU usage skyhigh

Post by Symfoni »

We have now updated to the latest version, 2009R1.4. The upgrade itself is working fine, but postmaster/postgre is still eating up a lot of CPU and slowing the whole system down.

A screenshot from a few minutes ago of 'top', sorted by processor usage, showed 13 postmaster processes each using from 11.0% to 19.6%, totaling at 176.7% which divided over 2 cores is 88.35% of system cpu.
Memory usage for those processes is at 0.3% of total ram each, and rarely do any other processes take more than 1% or so of ram.

The weird this is, when i took the screenshot, i had disabled http and https to this host in the company firewall, and had 'tail -f /var/log/httpd/access_log' running in another putty window. I took the screenshot somewhere midway through a period of about half an hour where the only lines outputting from the 'tail' were

Code: Select all

127.0.0.1 - - [01/Feb/2011:11:17:07 +0100] "POST /nagiosxi/backend/ HTTP/1.1" 200 813 "-" "BinGet/1.00.A (http://www.bin-co.com/php/scripts/load/)"
every minute
and

Code: Select all

127.0.0.1 - - [01/Feb/2011:11:09:27 +0100] "GET / HTTP/1.0" 200 2528 "-" "check_http/v2053 (nagios-plugins 1.4.13)"
every 5 minutes.

I checked the total amount of ram used, using 'free -m', and it showed about 3.5gb of the total 4gb installed on the system being in use.
I thought this was a bit too much, and checked /proc/sys/vm/drop_caches to see if pagecaching was on, and it was. I then ran 'sync' and changed the value in /proc/sys/vm/drop_caches from '0' to '3' to disable pagecaching and caching of dentries and inodes to free up some ram.
Immediately following the change, cpu usage dropped to hardly anything, and the output of 'free -m' showed an incline in ram used, from a low of about 300-400mb, to about 2200-2400mb. During that climb, postmaster hardly showed up at all on the 'top' output, but the web GUI for nagiosXI was slower than ever. What looked like regular html worked fine, but ajax stuff just showed the "hourglass". After a few minutes the amount of ram used had reached 2400mb and sort of "settled" there between 2200 and 2400mb of ram used, slightly creeping above 2400 from time to time.
Then the postmaster processes began popping up again in the 'top' output, hogging the highest rankings for most cpu usage. At the same time, the web GUI started reacting again and actually showing ajax-elements. However, it was (and still is) all too slow.

I found it a little odd that postmaster was going full steam ahead even though there were no user-generated http requests, but i hope maybe you found some useful info above.
rdedon
Posts: 578
Joined: Sat Nov 20, 2010 4:51 pm

Re: System is slow, CPU usage skyhigh

Post by rdedon »

This type of information is extremely useful for us, especially the steps you took for attempting a work-around.
Rene deDon
Technical Team
___
Nagios Enterprises, LLC
Web: http://www.nagios.com
Symfoni
Posts: 17
Joined: Wed Jan 05, 2011 5:46 pm

Re: System is slow, CPU usage skyhigh

Post by Symfoni »

We are still experiencing crazy loads from postmaster/postgres. Do you have any further ideas what may be the problem or cause?

Here's a screenshot of a 'top' from the machine, sorted by %CPU, in case you have use for it.
topP.png
You do not have the required permissions to view the files attached to this post.
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: System is slow, CPU usage skyhigh

Post by mguthrie »

Can you access the following directory and look through the postgresql logs to see if there's anything telling in there. My guess is that there could be some table corruption or something that's making a single connection hang up, and then the rest just hang up.

cd /var/lib/pgsql/data/pg_log

There will be several log files in there for each day, feel free to post anything that you feel might point us in the right direction. I'm looking up some DOC's on checking and repairing postgresql tables.
Symfoni
Posts: 17
Joined: Wed Jan 05, 2011 5:46 pm

Re: System is slow, CPU usage skyhigh

Post by Symfoni »

All that's in the logs are two separate error messages, repeated to make a filesize of about 20kb per day.
The error messages are "LOG: could not receive data from client: Connection reset by peer" and "LOG: unexpected EOF on client connection" with the latter being more frequent. I've attached todays log for reference.
postgresql-Mon.log
I also checked netstat for info on any existing connections to postgresql, and the only connections are from localhost (IP 127.0.0.1), but there are 69 connections with the postgresql port at either the 'Local Address' or 'Foreign Address'. Is it normal for there to be several dozen connections to/from postgresql service?
You do not have the required permissions to view the files attached to this post.
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: System is slow, CPU usage skyhigh

Post by mguthrie »

Can you run:
ps aux | grep dbmaint.php

And see if you have more than 1 instance of that script running?

Otherwise, can you try running:

/usr/local/nagiosxi/cron/dbmaint.php directly from the command-line and post the output?
Symfoni
Posts: 17
Joined: Wed Jan 05, 2011 5:46 pm

Re: System is slow, CPU usage skyhigh

Post by Symfoni »

'ps aux|grep dbmaint' gave no output aside from the grep itself, so no dbmaint.php processes were running.

running the script however gave a bit more output:

Code: Select all

# /usr/local/nagiosxi/cron/dbmaint.php
CLEANING ndoutils TABLE 'externalcommands'...
SQL: DELETE FROM nagios_externalcommands WHERE entry_time < FROM_UNIXTIME(1301387076)
CLEANING ndoutils TABLE 'logentries'...
SQL: DELETE FROM nagios_logentries WHERE logentry_time < FROM_UNIXTIME(1270455876)
CLEANING ndoutils TABLE 'statehistory'...
SQL: DELETE FROM nagios_statehistory WHERE state_time < FROM_UNIXTIME(1238919876)
CLEANING ndoutils TABLE 'timedevents'...
SQL: DELETE FROM nagios_timedevents WHERE event_time < FROM_UNIXTIME(1301991576)
CLEANING ndoutils TABLE 'systemcommands'...
SQL: DELETE FROM nagios_systemcommands WHERE start_time < FROM_UNIXTIME(1301991576)
CLEANING ndoutils TABLE 'servicechecks'...
SQL: DELETE FROM nagios_servicechecks WHERE start_time < FROM_UNIXTIME(1301991576)
CLEANING ndoutils TABLE 'hostchecks'...
SQL: DELETE FROM nagios_hostchecks WHERE start_time < FROM_UNIXTIME(1301991576)
CLEANING ndoutils TABLE 'eventhandlers'...
SQL: DELETE FROM nagios_eventhandlers WHERE start_time < FROM_UNIXTIME(1301991576)
CLEANING nagiosxi TABLE 'commands'...
SQL: DELETE FROM xi_commands WHERE processing_time < 1301963076::abstime::timestamp without time zone
CLEANING nagiosxi TABLE 'events'...
SQL: DELETE FROM xi_events WHERE processing_time < 1301963076::abstime::timestamp without time zone
SQL1: SELECT xi_meta.meta_id FROM xi_meta LEFT JOIN xi_events ON xi_meta.metaobj_id=xi_events.event_id WHERE metatype_id='1' AND event_id IS NULL
SQL2: DELETE FROM xi_meta WHERE meta_id IN (SELECT xi_meta.meta_id FROM xi_meta LEFT JOIN xi_events ON xi_meta.metaobj_id=xi_events.event_id WHERE metatype_id='1' AND event_id IS NULL)
CLEANING nagiosql TABLE 'logbook'...
SQL: DELETE FROM tbl_logbook WHERE time < FROM_UNIXTIME(1301963076)
REPAIRING NAGIOSQL TABLE: tbl_contact
SQL: REPAIR TABLE tbl_contact
REPAIRING NAGIOSQL TABLE: tbl_host
SQL: REPAIR TABLE tbl_host
REPAIRING NAGIOSQL TABLE: tbl_lnkHostToHost
SQL: REPAIR TABLE tbl_lnkHostToHost
REPAIRING NAGIOSQL TABLE: tbl_lnkHostdependencyToHost_DH
SQL: REPAIR TABLE tbl_lnkHostdependencyToHost_DH
REPAIRING NAGIOSQL TABLE: tbl_lnkHostdependencyToHost_H
SQL: REPAIR TABLE tbl_lnkHostdependencyToHost_H
REPAIRING NAGIOSQL TABLE: tbl_lnkServiceToHost
SQL: REPAIR TABLE tbl_lnkServiceToHost
REPAIRING NAGIOSQL TABLE: tbl_lnkServicedependencyToService_DS
SQL: REPAIR TABLE tbl_lnkServicedependencyToService_DS
REPAIRING NAGIOSQL TABLE: tbl_lnkServicedependencyToService_S
SQL: REPAIR TABLE tbl_lnkServicedependencyToService_S
REPAIRING NAGIOSQL TABLE: tbl_lnkServiceToHostgroup
SQL: REPAIR TABLE tbl_lnkServiceToHostgroup
REPAIRING NAGIOSQL TABLE: tbl_logbook
SQL: REPAIR TABLE tbl_logbook
REPAIRING NAGIOSQL TABLE: tbl_service
SQL: REPAIR TABLE tbl_service
REPAIRING NAGIOSQL TABLE: tbl_timeperiod
SQL: REPAIR TABLE tbl_timeperiod
REPAIRING NAGIOSQL TABLE: tbl_timedefinition
SQL: REPAIR TABLE tbl_timedefinition
REPAIRING NAGIOSQL TABLE: tbl_user
SQL: REPAIR TABLE tbl_user
Running the script didn't have any immediate impact on postmaster's cpu usage, however:
topoutput.png
You do not have the required permissions to view the files attached to this post.
rdedon
Posts: 578
Joined: Sat Nov 20, 2010 4:51 pm

Re: System is slow, CPU usage skyhigh

Post by rdedon »

There are a few things here that may be of help that look like it relates specifically to the issues occurring:

http://wiki.postgresql.org/wiki/Priorities

Please give these a try and report results when you can.
Rene deDon
Technical Team
___
Nagios Enterprises, LLC
Web: http://www.nagios.com
Locked