Sorry to beat a dead horse but our XI instance continues to head downhill. The web GUI is sluggish, system load still goes high with many blocked processes. Now barely any services show up and now postgresql has scores of processes running, so many that last night we went over the total processes threshold. I've followed all the steps in the orphaned service FAQ, rebooted several times to no avail. This is still 2011R3.3 on CentOS 5.7 64 bit. I thought I had this licked when I stopped syslog from writing to local files as it seemed to be the culprit, but things soon became bad again. Here's the a sample of the postgresql processes running this morning.
I ran those vacuum runs this morning along with almost everything else in that page and rebooted and I got my services back. Here's what the log says so I don't know if it applies.
I just got an alert for load average over 22! vmstat shows up to 8 processes being blocked and iostat is clearly showing high iowait percentages on my RAID 1 device that is /usr/local. Those are 10K SAS drives. I am wondering if going to mirrored SSDs will get me out of this mess. Thoughts?
I just installed the HP RAID utility for Linux. Turns out my write cache is disabled because the battery is dead. It's also a cache known to be problematic for this particular RAID controller (P400i). I am going to swap that out and see if that's been my problem all along.
It has been a week now since i replaced the cache and not one blocked process, no high load and very low io wait, so I *think* I have this resolved. My apologies for all the bandwidth spent on the wrong cause.