Tuning STATE = RSZDT check

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Tuning STATE = RSZDT check

Post by jbennett »

I am looking to more finely tune the check on our local machine for STATE = RSZDT. I'm wondering if my settings might be a little too conservative for my set-up and would appreciat some feedback.

# Active Host / Service Checks: 1696 / 4303
# Passive Host / Service Checks: 0 / 0

I am seeing a critical state for PROCS CRITICAL: 484 processes with STATE = RSZDT

My check:

Code: Select all

$USER1$/check_procs -w 250 -c 400 -s RSZDT
After reading through some other threads, I checked for zombie processes: ps aux | grep " Z. "
I ran this check multiple times back to back and noticed a different output each time. I am going to assume that this is normal as the processes are being killed off?

Code: Select all

[user@server ~]$ ps aux | grep " Z. "
nagios    3306  0.0  0.0      0     0 ?        Z    09:02   0:00 [nagios] <defunct>
nagios    3308  0.0  0.0      0     0 ?        Z    09:02   0:00 [nagios] <defunct>
nagios    3310  0.0  0.0      0     0 ?        Z    09:02   0:00 [nagios] <defunct>
nagios    3312  0.0  0.0      0     0 ?        Z    09:02   0:00 [nagios] <defunct>
user      3316  0.0  0.0  61180   764 pts/0    S+   09:02   0:00 grep  Z.
[user@server ~]$ ps aux | grep " Z. "
user      3670  0.0  0.0  61180   740 pts/0    R+   09:02   0:00 grep  Z.
[user@server ~]$ ps aux | grep " Z. "
nagios    3755  0.0  0.0      0     0 ?        Z    09:02   0:00 [nagios] <defunct>
nagios    3768  0.0  0.0      0     0 ?        Z    09:02   0:00 [check_dummy] <defunct>
user      3775  0.0  0.0  61180   740 pts/0    R+   09:02   0:00 grep  Z.
[user@server ~]$ ps aux | grep " Z. "
user      3852  0.0  0.0  61180   764 pts/0    S+   09:02   0:00 grep  Z.
[user@server ~]$ ps aux | grep " Z. "
nagios    3863  0.0  0.0      0     0 ?        Z    09:02   0:00 [nagios] <defunct>
nagios    3867  0.0  0.0      0     0 ?        Z    09:02   0:00 [nagios] <defunct>
nagios    3876  0.0  0.0      0     0 ?        Z    09:02   0:00 [nagios] <defunct>
user      3880  0.0  0.0  61180   740 pts/0    R+   09:02   0:00 grep  Z.
[user@server ~]$ ps aux | grep " Z. "
user      3889  0.0  0.0  61180   764 pts/0    S+   09:02   0:00 grep  Z.
[user@server ~]$ ps aux | grep " Z. "
nagios    3896  0.0  0.0      0     0 ?        Z    09:02   0:00 [nagios] <defunct>
user      3915  0.0  0.0  61184   820 pts/0    S+   09:02   0:00 grep  Z.
[user@server ~]$ ps aux | grep " Z. "
nagios    3940  0.0  0.0      0     0 ?        Z    09:02   0:00 [nagios] <defunct>
nagios    3942  0.0  0.0      0     0 ?        Z    09:02   0:00 [nagios] <defunct>
nagios    3945  0.0  0.0      0     0 ?        Z    09:02   0:00 [nagios] <defunct>
nagios    3947  0.0  0.0      0     0 ?        Z    09:02   0:00 [nagios] <defunct>
nagios    3951  0.0  0.0      0     0 ?        Z    09:02   0:00 [nagios] <defunct>
user      3972  0.0  0.0  61180   736 pts/0    R+   09:02   0:00 grep  Z.
I have checked for processes and output the result to a .txt file which I have attached. Can someone please verify that this is normal and I just need to adjust my check settings?

When I check processes, I do see what appears to be an awful lot of the following. Should this be?

Code: Select all

nagios    2043  2034  0 02:39 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/cleaner.php > /usr/local/nagiosxi/var/cleaner.log 2>&1
nagios    2044  2043  0 02:39 ?        00:00:02 /usr/bin/php -q /usr/local/nagiosxi/cron/cleaner.php
postgres  2082  3232  0 02:39 ?        00:00:34 postgres: nagiosxi nagiosxi 127.0.0.1(35123) idle 
You do not have the required permissions to view the files attached to this post.
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Tuning STATE = RSZDT check

Post by slansing »

Can you kill PHP off completely and then do a running tail of the following log for two minutes or so? Let us know if you see any errors:

Code: Select all

killall -9 php

Code: Select all

tail -f /usr/local/nagiosxi/var/cleaner.log
It would not be a bad idea to take a decent sized chunk of it and report it back too.
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Re: Tuning STATE = RSZDT check

Post by jbennett »

Line 130 of /usr/local/nagiosxi/html/includes/components/ccm/ccm.inc.php:

Code: Select all

file_put_contents($ccm_cfg,$content);

Code: Select all

[root@server ~]# killall -9 php
[root@server ~]# tail -f /usr/local/nagiosxi/var/cleaner.log
Nagios IM Send ERROR: 0
Nagios IM Sending Incident...
Nagios IM Send ERROR: 0
Nagios IM Sending Incident...
Nagios IM Send ERROR: 0
Nagios IM Sending Incident...
Nagios IM Send ERROR: 0
Nagios IM Sending Incident...
Nagios IM Send ERROR: 0
Nagios IM Sending Incident...
tail: /usr/local/nagiosxi/var/cleaner.log: file truncated
PHP Warning:  file_put_contents(/usr/local/nagiosxi/etc/components/ccm_config.inc.php): failed to open stream: Permission denied in /usr/local/nagiosxi/html/includes/components/ccm/ccm.inc.php on line 130
DIR: /usr/local/nagiosxi/nom/checkpoints/nagioscore
NUMFOUND: 10
KEEPING ALL GOOD CHECKPOINTS
DIR: /usr/local/nagiosxi/nom/checkpoints/nagioscore/errors
ls: /usr/local/nagiosxi/nom/checkpoints/nagioscore/errors/*.gz: No such file or directory
NUMFOUND: 0
KEEPING ALL ERROR CHECKPOINTS
DIR: /usr/local/nagiosxi/nom/checkpoints/nagiosxi
NUMFOUND: 10
KEEPING ALL SNAPSHOTS
NAGIOS IM FLUSH INCIDENTS
COUNT:2833
Nagios IM Sending Incident...
Nagios IM Send ERROR: 0
Nagios IM Sending Incident...
Nagios IM Send ERROR: 0
Nagios IM Sending Incident...
Nagios IM Send ERROR: 0
Nagios IM Sending Incident...
tail: /usr/local/nagiosxi/var/cleaner.log: file truncated
PHP Warning:  file_put_contents(/usr/local/nagiosxi/etc/components/ccm_config.inc.php): failed to open stream: Permission denied in /usr/local/nagiosxi/html/includes/components/ccm/ccm.inc.php on line 130
DIR: /usr/local/nagiosxi/nom/checkpoints/nagioscore
ls: /usr/local/nagiosxi/nom/checkpoints/nagioscore/errors/*.gz: No such file or directory
NUMFOUND: 10
KEEPING ALL GOOD CHECKPOINTS
DIR: /usr/local/nagiosxi/nom/checkpoints/nagioscore/errors
Nagios IM Send ERROR: 0
Nagios IM Sending Incident...
Nagios IM Send ERROR: 0
Nagios IM Sending Incident...
Nagios IM Send ERROR: 0
Nagios IM Sending Incident...
Nagios IM Send ERROR: 0
Nagios IM Sending Incident...
tail: /usr/local/nagiosxi/var/cleaner.log: file truncated
PHP Warning:  file_put_contents(/usr/local/nagiosxi/etc/components/ccm_config.inc.php): failed to open stream: Permission denied in /usr/local/nagiosxi/html/includes/components/ccm/ccm.inc.php on line 130
DIR: /usr/local/nagiosxi/nom/checkpoints/nagioscore
NUMFOUND: 10
KEEPING ALL GOOD CHECKPOINTS
DIR: /usr/local/nagiosxi/nom/checkpoints/nagioscore/errors
ls: /usr/local/nagiosxi/nom/checkpoints/nagioscore/errors/*.gz: No such file or directory
NUMFOUND: 0
KEEPING ALL ERROR CHECKPOINTS
DIR: /usr/local/nagiosxi/nom/checkpoints/nagiosxi
NUMFOUND: 10
KEEPING ALL SNAPSHOTS
NAGIOS IM FLUSH INCIDENTS
COUNT:2833
Nagios IM Sending Incident...
Nagios IM Send ERROR: 0
Nagios IM Sending Incident...
Nagios IM Send ERROR: 0
Nagios IM Sending Incident...
Nagios IM Send ERROR: 0
Nagios IM Sending Incident...
Nagios IM Send ERROR: 0
Nagios IM Sending Incident...
tail: /usr/local/nagiosxi/var/cleaner.log: file truncated
PHP Warning:  file_put_contents(/usr/local/nagiosxi/etc/components/ccm_config.inc.php): failed to open stream: Permission denied in /usr/local/nagiosxi/html/includes/components/ccm/ccm.inc.php on line 130
DIR: /usr/local/nagiosxi/nom/checkpoints/nagioscore
NUMFOUND: 10
KEEPING ALL GOOD CHECKPOINTS
DIR: /usr/local/nagiosxi/nom/checkpoints/nagioscore/errors
ls: /usr/local/nagiosxi/nom/checkpoints/nagioscore/errors/*.gz: No such file or directory
NUMFOUND: 0
KEEPING ALL ERROR CHECKPOINTS
DIR: /usr/local/nagiosxi/nom/checkpoints/nagiosxi
NUMFOUND: 10
KEEPING ALL SNAPSHOTS
NAGIOS IM FLUSH INCIDENTS
COUNT:2833
Nagios IM Sending Incident...
Nagios IM Send ERROR: 0
Nagios IM Sending Incident...
Nagios IM Send ERROR: 0
Nagios IM Sending Incident...
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Tuning STATE = RSZDT check

Post by slansing »

Can you run the following and place their output here? We may have a permissions issue:

Code: Select all

ll -la /usr/local/nagiosxi/etc/components/ccm_config.inc.php

ll -la /usr/local/nagiosxi/etc/components/
As well as the following:

Code: Select all

echo "show processlist;" | mysql -pnagiosxi| wc -l

Do you have many open files on this system? Or many additional cron's besides those normally associated with Nagios XI or your system by default?
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Re: Tuning STATE = RSZDT check

Post by jbennett »

Code: Select all

[root@system ~]# ll -la /usr/local/nagiosxi/etc/components/ccm_config.inc.php
-rw-r--r-- 1 apache nagios 565 Mar 13 11:10 /usr/local/nagiosxi/etc/components/ccm_config.inc.php
[root@system ~]# ll -la /usr/local/nagiosxi/etc/components/
total 392
drwsrwsr-x 3 apache nagios   4096 Oct 16 16:21 .
drwxr-xr-x 3 nagios nagios   4096 May  4  2011 ..
-rwxrwxr-x 1 apache nagios 188086 Jan 18 11:41 bpi.conf
-rwxrwxr-x 1 apache nagios 187172 Jan 18 11:41 bpi.conf.backup
-rw-r--r-- 1 apache nagios    565 Mar 13 11:10 ccm_config.inc.php
drwsrwsr-x 2 apache nagios   4096 May  4  2011 webinject
[root@system ~]# echo "show processlist;" | mysql -pnagiosxi| wc -l
42
As for open files, I'm not sure what a normal amount might be, but after looking at the output, I'm goign to guess the answer is yes. I attached the results of a lsof on the system.

When I grep for currently running crons, this is what I see:

Code: Select all

[root@system ~]# ps ax | grep cron
 3352 ?        Ss     0:01 crond
10363 ?        S      0:00 crond
10371 ?        Ss     0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/cleaner.php > /usr/local/nagiosxi/var/cleaner.log 2>&1
10374 ?        S      0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/cleaner.php
11174 ?        S      0:00 crond
11180 ?        Ss     0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/cleaner.php > /usr/local/nagiosxi/var/cleaner.log 2>&1
11185 ?        S      0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/cleaner.php
13713 ?        S      0:00 crond
13734 ?        Ss     0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/cleaner.php > /usr/local/nagiosxi/var/cleaner.log 2>&1
13735 ?        S      0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/cleaner.php
15681 ?        S      0:00 crond
15686 ?        Ss     0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/cleaner.php > /usr/local/nagiosxi/var/cleaner.log 2>&1
15687 ?        S      0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/cleaner.php
17926 ?        S      0:00 crond
17932 ?        Ss     0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/cleaner.php > /usr/local/nagiosxi/var/cleaner.log 2>&1
17934 ?        S      0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/cleaner.php
29186 ?        S      0:00 crond
29188 ?        S      0:00 crond
29189 ?        S      0:00 crond
29193 ?        Ss     0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php > /usr/local/nagiosxi/var/perfdataproc.log 2>&1
29197 ?        S      0:00 crond
29200 ?        Ss     0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php > /usr/local/nagiosxi/var/eventman.log 2>&1
29201 ?        S      0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php
29202 ?        Ss     0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php > /usr/local/nagiosxi/var/sysstat.log 2>&1
29203 ?        S      0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php
29204 ?        S      0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php
29205 ?        Ss     0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php > /usr/local/nagiosxi/var/cmdsubsys.log 2>&1
29206 ?        S      0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php
31047 pts/0    S+     0:00 grep cron
As for the crons, it is a possibility as this VM is being maintained by our server group, while I am just handling the Nagios side of things.

I'm not certain how I can check for ALL crons that might run, but I do find this:

Code: Select all

[root@server etc]# ls cro*
cron.deny  crontab

cron.d:
mrtg  nagiosxi  sysstat

cron.daily:
00webalizer  0anacron  0logwatch  certwatch  cups  logrotate  makewhatis.cron  mlocate.cron  prelink  rhsmd  rpm  tmpwatch

cron.hourly:
mcelog.cron

cron.monthly:
0anacron

cron.weekly:
0anacron  99-raid-check  makewhatis.cron
You do not have the required permissions to view the files attached to this post.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Tuning STATE = RSZDT check

Post by scottwilkerson »

Actually, there was a bug with some items being run by cleaner.php that was fixed in 2012R1.5

I would strongly recommend upgrading to the latest 2012R1.6
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Re: Tuning STATE = RSZDT check

Post by jbennett »

Easy enough! Thanks!
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Tuning STATE = RSZDT check

Post by slansing »

Locking as resolved.
Locked