Page 1 of 3

PROCS WARNING: 257 processes with STATE = RSZDT

Posted: Fri Nov 08, 2013 8:12 am
by WillemDH
Hello,

We just received a warning for our Nagios server: PROCS WARNING: 257 processes with STATE = RSZDT.

As we have only 262 hosts and 2192 services at the moment and this, I was wondering if 257 RSZDR processes are an issue? If not what would be the recommended warning / error thresholds?

Thanks.

Willem

Re: PROCS WARNING: 257 processes with STATE = RSZDT

Posted: Fri Nov 08, 2013 10:37 am
by slansing
That is not necessarily a bad amount of processes, since nagios and apache commonly fork, and depending on what you are checking the plugin may also do so. I would suggest bumping the warning up to 300 or so and if you have the chance, watching TOP to see if it is just a spike that is occurring.

Re: PROCS WARNING: 257 processes with STATE = RSZDT

Posted: Tue Nov 12, 2013 9:33 am
by WillemDH
Ok, changing the warning threshold to 300 would be ok for me, but in the meantime I'm at 289 processes. So apparently the RSZDT processes seem to rise, which would mean the threshold for the service would be exceeded in a few days... Any idea why these RSZDT services would suddenly start going up?

r = running,
s =interruptible sleep (waiting to complete),
z = defunct ("zombie")process,
d = uninterruptible sleep,
t = stopped


I did a ps aux | grep " Z. " as suggested in other thread and have this as result:

root 8769 0.0 0.0 103244 852 pts/1 S+ 15:38 0:00 grep Z.
root 24692 0.0 0.0 0 0 ? Z Oct31 0:00 [abrt-server] <defunct>
root 24693 0.0 0.0 0 0 ? Z Oct31 0:00 [abrt-server] <defunct>

Did it a few times and mostly I got the above result, but a few time I got:

nagios 9266 0.0 0.0 0 0 ? Z 15:39 0:00 [nagios] <defunct>
nagios 9268 0.0 0.0 0 0 ? Z 15:39 0:00 [nagios] <defunct>
nagios 9270 0.0 0.0 0 0 ? Z 15:39 0:00 [nagios] <defunct>
nagios 9272 0.0 0.0 0 0 ? Z 15:39 0:00 [nagios] <defunct>
nagios 9273 0.0 0.0 0 0 ? Z 15:39 0:00 [nagios] <defunct>
nagios 9275 0.0 0.0 0 0 ? Z 15:39 0:00 [nagios] <defunct>
nagios 9281 0.0 0.0 0 0 ? Z 15:39 0:00 [nagios] <defunct>
nagios 9282 0.0 0.0 0 0 ? Z 15:39 0:00 [nagios] <defunct>
root 9291 0.0 0.0 103244 852 pts/1 S+ 15:39 0:00 grep Z.
root 24692 0.0 0.0 0 0 ? Z Oct31 0:00 [abrt-server] <defunct>
root 24693 0.0 0.0 0 0 ? Z Oct31 0:00 [abrt-server] <defunct>

When I do the same ps aux | grep " S. " , it becomes clear the majority of the processes are found there. Last week, I added some databases to our monitoring using check_mssql_server.py and it seems these checks are the majority of the running processes. This is kind of worrying as I only added 2 databases on 2 mssql instances. And we have a lot of databases that still need to be added.

Re: PROCS WARNING: 257 processes with STATE = RSZDT

Posted: Tue Nov 12, 2013 11:55 am
by sreinhardt
Are these mssql checks taking a long time, or often going into a warning or critical state? How many checks per mssql instance did you add?

Re: PROCS WARNING: 257 processes with STATE = RSZDT

Posted: Wed Nov 13, 2013 4:04 am
by WillemDH
In the meantime we are at 298 processes with STATE = RSZDT. So apparently it keeps rising....
I added two mssql instances with 18 checks per instance and with 2 databases each and 17 services / checks per database. So in total +- 104 checks, who run every 5 minutes. Is this really too much?

I could change the check interval..., but can you tell me what's the maximum tim interval I can use with the rrd graphs working fine? (See issue described in http://support.nagios.com/forum/viewtop ... 8&start=30)

Apart from the processes, I now also have the swap usage in 'red' state... See screenshot... It seems this night around 00:00 it went from 158 MB free swap to 14 MB free swap space.. I'm not sure if this is related tot the processes issue, but seeing that there are a lot of python processes and the mssql check is a python script, this probabaly is related?

Doing a top O P gives

top - 10:03:49 up 75 days, 11:52, 1 user, load average: 0.46, 0.67, 0.43
Tasks: 406 total, 1 running, 403 sleeping, 0 stopped, 2 zombie
Cpu(s): 8.0%us, 2.2%sy, 0.0%ni, 89.4%id, 0.2%wa, 0.1%hi, 0.2%si, 0.0%st
Mem: 3921112k total, 2880696k used, 1040416k free, 118728k buffers
Swap: 262136k total, 246684k used, 15452k free, 1279432k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP COMMAND
2035 mysql 20 0 2203m 44m 3512 S 7.9 1.2 1125:42 25m mysqld
18058 nagios 20 0 215m 4356 4352 S 0.0 0.1 0:00.12 13m php
1866 root 20 0 160m 1512 292 S 0.0 0.0 1:30.69 9772 snmptt
1867 root 20 0 164m 3128 1016 S 0.0 0.1 4:33.53 8928 snmptt
2255 ajaxterm 20 0 163m 1884 1220 S 0.0 0.0 28:22.19 5580 python
14634 nagios 20 0 196m 3692 3688 S 0.0 0.1 0:00.03 4844 python
6973 nagios 20 0 196m 3692 3688 S 0.0 0.1 0:00.04 4840 python
6978 nagios 20 0 196m 3692 3688 S 0.0 0.1 0:00.04 4840 python
10008 nagios 20 0 196m 3692 3688 S 0.0 0.1 0:00.04 4840 python
12954 nagios 20 0 196m 3692 3688 S 0.0 0.1 0:00.04 4840 python
5188 nagios 20 0 196m 3768 3688 S 0.0 0.1 0:00.04 4764 python
6363 nagios 20 0 196m 3820 3688 S 0.0 0.1 0:00.05 4716 python
25801 nagios 20 0 196m 4152 3688 S 0.0 0.1 0:00.05 4680 python
15826 nagios 20 0 196m 3904 3688 S 0.0 0.1 0:00.05 4624 python
25955 nagios 20 0 196m 4148 3688 S 0.0 0.1 0:00.04 4380 python
32441 nagios 20 0 196m 4460 3688 S 0.0 0.1 0:00.05 4072 python
27742 nagios 20 0 196m 4728 3688 S 0.0 0.1 0:00.05 3800 python
29115 nagios 20 0 196m 5084 3688 S 0.0 0.1 0:00.05 3444 python
28829 nagios 20 0 196m 5136 3688 S 0.0 0.1 0:00.04 3396 python
13429 nagios 20 0 196m 5344 3688 S 0.0 0.1 0:00.04 3188 python
12970 nagios 20 0 196m 5380 3688 S 0.0 0.1 0:00.04 3152 python
12965 nagios 20 0 196m 5740 3688 S 0.0 0.1 0:00.04 3120 python
19837 nagios 20 0 196m 6092 3688 S 0.0 0.2 0:00.04 2844 python
15715 nagios 20 0 196m 5740 3688 S 0.0 0.1 0:00.04 2816 python
25961 nagios 20 0 196m 6016 3688 S 0.0 0.2 0:00.05 2816 python
8738 nagios 20 0 196m 5740 3688 S 0.0 0.1 0:00.04 2800 python
6077 nagios 20 0 196m 5740 3688 S 0.0 0.1 0:00.04 2792 python
4032 nagios 20 0 196m 5740 3688 S 0.0 0.1 0:00.04 2788 python
4227 nagios 20 0 196m 5740 3688 S 0.0 0.1 0:00.04 2788 python
6325 nagios 20 0 196m 5776 3688 S 0.0 0.1 0:00.05 2760 python
12111 root 20 0 327m 13m 6820 S 0.0 0.4 1:14.54 2724 httpd
29721 nagios 20 0 196m 6220 3688 S 0.0 0.2 0:00.06 2724 python
12758 apache 20 0 451m 38m 8324 S 0.0 1.0 5:06.67 2708 httpd
1442 apache 20 0 442m 31m 4532 S 0.0 0.8 1:12.96 2660 httpd

Re: PROCS WARNING: 257 processes with STATE = RSZDT

Posted: Wed Nov 13, 2013 10:22 am
by slansing
I'm quite surprised you only have 255mb of free swap available... is this a VM? That is a very low number. Also, the maximum amount of time "theoretically" that you could push checks out to before your RRDs will stop plotting data is around 1 Hour. I would not push it that far though.

Re: PROCS WARNING: 257 processes with STATE = RSZDT

Posted: Wed Nov 13, 2013 10:58 am
by WillemDH
This is your offical CentOS image. the server has 4 GB RAM. Isn't the swap space configured in your image? So I should try to make it larger? I found http://www.centos.org/docs/5/html/Deplo ... dding.html
Should I create a new swap partition, create a new swap file, or extend swap on an existing LVM2 logical volume?

Re: PROCS WARNING: 257 processes with STATE = RSZDT

Posted: Wed Nov 13, 2013 1:54 pm
by abrist
What is the average check latency? (screenshot the performance page)

Re: PROCS WARNING: 257 processes with STATE = RSZDT

Posted: Thu Nov 14, 2013 3:04 am
by WillemDH
See screenshot. I suppose the latency is ok. The max service execution time is pretty high, but how long goes this monitoring performance engine dashlet go back?
It seems the free swap space is still going down.. So should I follow the procedure in my previous post and make it larger?

Re: PROCS WARNING: 257 processes with STATE = RSZDT

Posted: Thu Nov 14, 2013 5:04 am
by WillemDH
In the meantime we had 0 MB free swap space, so I decided to give the server a reboot and it has all its swap space available now. I'll monitor it closely in the coming weeks. Please let me know if I should expand it.