Nagios performance woes continue

hhlodge · Post by **hhlodge** » Mon Jan 07, 2013 9:13 am

Sorry to beat a dead horse but our XI instance continues to head downhill. The web GUI is sluggish, system load still goes high with many blocked processes. Now barely any services show up and now postgresql has scores of processes running, so many that last night we went over the total processes threshold. I've followed all the steps in the orphaned service FAQ, rebooted several times to no avail. This is still 2011R3.3 on CentOS 5.7 64 bit. I thought I had this licked when I stopped syslog from writing to local files as it seemed to be the culprit, but things soon became bad again. Here's the a sample of the postgresql processes running this morning.

Code: Select all

[root@psm-itmon ~]# service postgresql status
postmaster (pid 32238 31930 31422 31127 30616 29757 29509 29254 29101 28969 28637 28462 28003 26681 26220 26181 25735 25592 25205 25150 24771 24685 24263 24222 24218 23778 23637 23216 22172 21531 21101 20768 20736 20271 19374 19031 18803 17711 17596 17368 16969 16945 16328 16067 14643 14202 14028 13624 13560 13193 13183 12647 12601 12280 12085 12056 11778 11542 11532 11339 10005 9591 9425 9154 8810 8662 7804 7723 7381 7302 7101 6912 6907 6726 6470 6059 5653 5629 5479 5476 5473 5428 4518 4515 4512 4509 4506 4494 4201 4200 4199 4178 4176 4047 3626 3553 3126 2586 2145 2020 1658 1512 1099) is running...
[root@psm-itmon ~]# vmstat 5 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 4856212 201328 743328    0    0     2    15    5   24 14  0 85  0  0
 1  1      0 4855668 201332 743344    0    0     0  1706 1749  251 15  0 76  8  0

[root@psm-itmon ~]# service nagios status
nagios is not running
[root@psm-itmon ~]# service nagios start
Starting nagios: done.
[root@psm-itmon ~]# ps -deaf | grep nagios
nagios    1024  4255  0 Jan06 ?        00:00:00 crond
nagios    1038  1024  0 Jan06 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php > /usr/local/nagiosxi/var/sysstat.log 2>&1
nagios    1042  1038  0 Jan06 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php
postgres  1099  4176  0 Jan06 ?        00:00:00 postgres: nagiosxi nagiosxi 127.0.0.1(40332) idle           
nagios    1445  4255  0 Jan06 ?        00:00:00 crond
nagios    1460  1445  0 Jan06 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php > /usr/local/nagiosxi/var/sysstat.log 2>&1
nagios    1467  1460  0 Jan06 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php
postgres  1512  4176  0 Jan06 ?        00:00:00 postgres: nagiosxi nagiosxi 127.0.0.1(58160) idle

There were scores the about sets of processes.

mguthrie · Post by **mguthrie** » Mon Jan 07, 2013 10:28 am

Does this wiki seem to apply to your situation?
http://support.nagios.com/wiki/index.ph ... .22_in_log

hhlodge · Post by **hhlodge** » Mon Jan 07, 2013 11:43 am

I ran those vacuum runs this morning along with almost everything else in that page and rebooted and I got my services back. Here's what the log says so I don't know if it applies.

Code: Select all

[root@psm-itmon pg_log]# grep wrap postgresql-Mon.log
LOG:  transaction ID wrap limit is 1741597677, limited by database "postgres"

mguthrie · Post by **mguthrie** » Mon Jan 07, 2013 3:38 pm

Do the large number of postmaster processes show up again after the reboot?

hhlodge · Post by **hhlodge** » Tue Jan 08, 2013 11:29 am

No, it's been fine since.

scottwilkerson · Post by **scottwilkerson** » Tue Jan 08, 2013 11:41 am

We'll leave this item unlocked for the time being, if it comes back please add to the thread.

hhlodge · Post by **hhlodge** » Tue Feb 12, 2013 2:36 pm

I just got an alert for load average over 22! vmstat shows up to 8 processes being blocked and iostat is clearly showing high iowait percentages on my RAID 1 device that is /usr/local. Those are 10K SAS drives. I am wondering if going to mirrored SSDs will get me out of this mess. Thoughts?

hhlodge · Post by **hhlodge** » Tue Feb 12, 2013 3:46 pm

I just installed the HP RAID utility for Linux. Turns out my write cache is disabled because the battery is dead. It's also a cache known to be problematic for this particular RAID controller (P400i). I am going to swap that out and see if that's been my problem all along.

slansing · Post by **slansing** » Tue Feb 12, 2013 4:02 pm

Alrighty, unfortunate but let us know how it works out.

hhlodge · Post by **hhlodge** » Mon Feb 25, 2013 8:52 am

It has been a week now since i replaced the cache and not one blocked process, no high load and very low io wait, so I *think* I have this resolved. My apologies for all the bandwidth spent on the wrong cause.

Nagios Support Forum

Nagios performance woes continue

Nagios performance woes continue

Re: Nagios performance woes continue

Re: Nagios performance woes continue

Re: Nagios performance woes continue

Re: Nagios performance woes continue

Re: Nagios performance woes continue

Re: Nagios performance woes continue

Re: Nagios performance woes continue

Re: Nagios performance woes continue

Re: Nagios performance woes continue