Page 1 of 2

Status critical

Posted: Thu Aug 17, 2017 1:31 pm
by Rovendra
Hi all,

I'm new to nagios, so I'm sorry if it's a newbie question :D :D . I've got nagios running to monitor some services and one of them is a haproxy service. I had internet issues yesterday and my servers were down for a couple hours. The problem is that after everything is back online and working the monitor of the service won't recover from critical state (the haproxy is up and running).
Any suggestions on what is causing this issue?

Thanks!

Re: Status critical

Posted: Thu Aug 17, 2017 2:44 pm
by bolson
Hello,

Can you run the haproxy check command from the command line on the server and pust the result?

Thank you!

Re: Status critical

Posted: Thu Aug 17, 2017 3:13 pm
by Rovendra
This is the check command run directly on the nagios server:

Re: Status critical

Posted: Fri Aug 18, 2017 1:21 pm
by bolson
When you force a check from the web gui what do you get?

Re: Status critical

Posted: Thu Sep 14, 2017 11:19 am
by Rovendra
Ok guys

Just to give an update on what happened since I had this problem. I've finally was able to figure out that the message is critical because I have more then 1 haproxy running on the server. That's why the answer is 2 and not 1, but that still means it's ok. Now I have two more questions ... the first is why the hell haproxy is spawning more then one process since it didn't do that before (the same is happening to my apache ... it's spawning 8 process and it didn't do it before). The second question is why the process is getting critical on the interface since the snmp check i do is:

check_snmp -o 1.3.6.1.4.1.2021.2.1.5.6 -C STGen2016 -r [1-9][0-9]* <server>

and the regex part is supposed to let nagios know that any number of processes is normal.

Any suggestions? And thanks in advace.

Re: Status critical

Posted: Thu Sep 14, 2017 1:18 pm
by Rovendra
Ok guys, I've found out more information. Looking into old pictures of nagios I've found that spawning 2 processes in haproxy and 7 in apache are supposed to be normal. So I'm guessing something changed in nagios (I don't know how that is possible since nobody touched this server) that is now treating multiple processes as a critical status. The strange fact still remains that a status check in the command line returns ok while the interface list their status as critical.

Obs: we use nagiosQL.

I appreciate any help.

Re: Status critical

Posted: Fri Sep 15, 2017 2:13 pm
by tgriep
Try running these commands to stop and start the Nagios Daemon.

Code: Select all

service nagios stop
killall -9 nagios
service nagios start
Logout of the GUI and log back in and see if the status is updated at the next check.

If not, run the following command and post the output

Code: Select all

ps -ef --cols=300
And, open the status.dat file, find that service entry and post it here as well.

Re: Status critical

Posted: Fri Sep 15, 2017 3:17 pm
by Rovendra
This is the output of the 'ps -ef --cols=300' command:

Code: Select all

[root@help /]# ps -ef --cols=300
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 Apr05 ?        00:00:52 /sbin/init
root         2     0  0 Apr05 ?        00:00:00 [kthreadd]
root         3     2  0 Apr05 ?        00:01:31 [migration/0]
root         4     2  0 Apr05 ?        00:01:17 [ksoftirqd/0]
root         5     2  0 Apr05 ?        00:00:00 [stopper/0]
root         6     2  0 Apr05 ?        00:00:11 [watchdog/0]
root         7     2  0 Apr05 ?        00:00:49 [migration/1]
root         8     2  0 Apr05 ?        00:00:00 [stopper/1]
root         9     2  0 Apr05 ?        00:00:25 [ksoftirqd/1]
root        10     2  0 Apr05 ?        00:00:09 [watchdog/1]
root        11     2  0 Apr05 ?        00:07:01 [events/0]
root        12     2  0 Apr05 ?        01:03:40 [events/1]
root        13     2  0 Apr05 ?        00:00:00 [cgroup]
root        14     2  0 Apr05 ?        00:00:00 [khelper]
root        15     2  0 Apr05 ?        00:00:00 [netns]
root        16     2  0 Apr05 ?        00:00:00 [async/mgr]
root        17     2  0 Apr05 ?        00:00:00 [pm]
root        18     2  0 Apr05 ?        00:00:27 [sync_supers]
root        19     2  0 Apr05 ?        00:00:41 [bdi-default]
root        20     2  0 Apr05 ?        00:00:00 [kintegrityd/0]
root        21     2  0 Apr05 ?        00:00:00 [kintegrityd/1]
root        22     2  0 Apr05 ?        00:09:28 [kblockd/0]
root        23     2  0 Apr05 ?        00:09:50 [kblockd/1]
root        24     2  0 Apr05 ?        00:00:00 [kacpid]
root        25     2  0 Apr05 ?        00:00:00 [kacpi_notify]
root        26     2  0 Apr05 ?        00:00:00 [kacpi_hotplug]
root        27     2  0 Apr05 ?        00:00:00 [ata_aux]
root        28     2  0 Apr05 ?        00:00:00 [ata_sff/0]
root        29     2  0 Apr05 ?        00:00:00 [ata_sff/1]
root        30     2  0 Apr05 ?        00:00:00 [ksuspend_usbd]
root        31     2  0 Apr05 ?        00:00:00 [khubd]
root        32     2  0 Apr05 ?        00:00:00 [kseriod]
root        33     2  0 Apr05 ?        00:00:00 [md/0]
root        34     2  0 Apr05 ?        00:00:00 [md/1]
root        35     2  0 Apr05 ?        00:00:00 [md_misc/0]
root        36     2  0 Apr05 ?        00:00:00 [md_misc/1]
root        37     2  0 Apr05 ?        00:00:00 [linkwatch]
root        39     2  0 Apr05 ?        00:00:03 [khungtaskd]
root        40     2  0 Apr05 ?        00:03:00 [kswapd0]
root        41     2  0 Apr05 ?        00:00:00 [ksmd]
root        42     2  0 Apr05 ?        00:02:23 [khugepaged]
root        43     2  0 Apr05 ?        00:00:00 [aio/0]
root        44     2  0 Apr05 ?        00:00:00 [aio/1]
root        45     2  0 Apr05 ?        00:00:00 [crypto/0]
root        46     2  0 Apr05 ?        00:00:00 [crypto/1]
root        54     2  0 Apr05 ?        00:00:00 [kthrotld/0]
root        55     2  0 Apr05 ?        00:00:00 [kthrotld/1]
root        56     2  0 Apr05 ?        00:00:00 [pciehpd]
root        58     2  0 Apr05 ?        00:00:00 [kpsmoused]
root        59     2  0 Apr05 ?        00:00:00 [usbhid_resumer]
root        60     2  0 Apr05 ?        00:00:00 [deferwq]
root        92     2  0 Apr05 ?        00:00:00 [kdmremove]
root        93     2  0 Apr05 ?        00:00:00 [kstriped]
root       170     2  0 Apr05 ?        00:00:00 [scsi_eh_0]
root       171     2  0 Apr05 ?        00:00:00 [scsi_eh_1]
root       177     2  0 Apr05 ?        00:03:38 [mpt_poll_0]
root       178     2  0 Apr05 ?        00:00:00 [mpt/0]
root       179     2  0 Apr05 ?        00:00:00 [scsi_eh_2]
root       319     2  0 Apr05 ?        00:00:00 [kdmflush]
root       321     2  0 Apr05 ?        00:00:00 [kdmflush]
root       338     2  0 Apr05 ?        00:58:12 [jbd2/dm-0-8]
root       339     2  0 Apr05 ?        00:00:00 [ext4-dio-unwrit]
root       427     1  0 Apr05 ?        00:00:00 /sbin/udevd -d
ntp        630     1  0 Apr14 ?        00:00:27 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
root       640     2  0 Apr05 ?        00:02:55 [vmmemctl]
root       769     2  0 Apr05 ?        00:00:00 [jbd2/sda1-8]
root       770     2  0 Apr05 ?        00:00:00 [ext4-dio-unwrit]
root       807     2  0 Apr05 ?        00:00:22 [kauditd]
root       900     2  0 Apr05 ?        00:15:12 [flush-253:0]
root      1186     1  0 Apr05 ?        01:39:41 /usr/sbin/vmtoolsd
root      1213     1  0 Apr05 ?        00:00:00 /usr/lib/vmware-vgauth/VGAuthService -s
root      1302     1  0 Apr05 ?        00:00:28 /sbin/dhclient -1 -q -lf /var/lib/dhclient/dhclient-eth0.leases -pf /var/run/dhclient-eth0.pid eth0
root      1362     1  0 Apr05 ?        00:01:08 auditd
root      1392     1  0 Apr05 ?        00:00:35 /sbin/rsyslogd -i /var/run/syslogd.pid -c 5
named     1417     1  0 Apr05 ?        00:00:48 /usr/sbin/named -u named
root      1469     1  0 Apr05 ?        00:00:44 /usr/sbin/sshd
root      1513     1  0 Apr05 ?        00:00:00 /bin/sh /usr/bin/mysqld_safe --datadir=/var/lib/mysql --socket=/var/lib/mysql/mysql.sock --pid-file=/var/run/mysqld/mysqld.pid --basedir=/usr --user=mysql
mysql     1618  1513  0 Apr05 ?        19:01:00 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
root      1711     1  0 Apr05 ?        00:00:50 /usr/libexec/postfix/master
postfix   1721  1711  0 Apr05 ?        00:00:22 qmgr -l -t fifo -u
root      1722     1  0 Apr05 ?        00:06:30 /usr/sbin/httpd
root      1732     1  0 Apr05 ?        00:01:34 crond
nagios    1788     1  0 Apr05 ?        00:47:41 /usr/local/pnp4nagios/bin/npcd -d -f /usr/local/pnp4nagios/etc/npcd.cfg
root      1798     1  0 Apr05 tty2     00:00:00 /sbin/mingetty /dev/tty2
root      1800     1  0 Apr05 tty3     00:00:00 /sbin/mingetty /dev/tty3
root      1802     1  0 Apr05 tty4     00:00:00 /sbin/mingetty /dev/tty4
root      1804     1  0 Apr05 tty5     00:00:00 /sbin/mingetty /dev/tty5
root      1806     1  0 Apr05 tty6     00:00:00 /sbin/mingetty /dev/tty6
root      1813   427  0 Apr05 ?        00:00:00 /sbin/udevd -d
root      1814   427  0 Apr05 ?        00:00:00 /sbin/udevd -d
root      2835     1  0 Apr06 tty1     00:00:00 /sbin/mingetty /dev/tty1
postfix   4826  1711  0 14:45 ?        00:00:00 pickup -l -t fifo -u
apache    4983  1722  0 05:35 ?        00:00:02 /usr/sbin/httpd
apache    5092  1722  0 05:36 ?        00:00:02 /usr/sbin/httpd
apache    5095  1722  0 05:36 ?        00:00:03 /usr/sbin/httpd
apache    5096  1722  0 05:36 ?        00:00:02 /usr/sbin/httpd
apache    5785  1722  0 05:42 ?        00:00:02 /usr/sbin/httpd
nagios    6081     1  0 14:56 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios    6083  6081  0 14:56 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    6084  6081  0 14:56 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    6085  6081  0 14:56 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    6086  6081  0 14:56 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    6087  6081  0 14:56 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root      6348 19053  0 14:58 pts/1    00:00:00 vi status.dat
nagios    7518  6084  0 15:07 ?        00:00:00 /usr/local/nagios/libexec/check_ping -H 177.71.17.71 -w 1000.0,80% -c 2000.0,100% -p 5 -4
nagios    7519  7518  0 15:07 ?        00:00:00 /bin/ping -n -U -w 15 -c 5 177.71.17.71
nagios    7534  6086  0 15:07 ?        00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_oid -H <server> -p 1161 -o 1.3.6.1.4.1.42.2.145.3.163.1.1.2.11.0 -C STGen2016
root      7535 29515  0 15:07 pts/0    00:00:00 ps -ef --cols=300
apache    8631  1722  0 06:07 ?        00:00:03 /usr/sbin/httpd
apache    9425  1722  0 10:52 ?        00:00:01 /usr/sbin/httpd
apache   10037  1722  0 06:19 ?        00:00:02 /usr/sbin/httpd
apache   10575  1722  0 06:23 ?        00:00:03 /usr/sbin/httpd
apache   11829  1722  0 11:11 ?        00:00:01 /usr/sbin/httpd
root     19036  1469  0 12:09 ?        00:00:00 sshd: root@pts/1
root     19053 19036  0 12:09 pts/1    00:00:00 -bash
root     23837  1469  0 08:17 ?        00:00:01 sshd: root@pts/0
root     23855 23837  0 08:17 pts/0    00:00:00 -bash
root     29463 23855  0 13:39 pts/0    00:00:00 su nagios
nagios   29464 29463  0 13:39 pts/0    00:00:00 bash
root     29507 29464  0 13:39 pts/0    00:00:00 su root
root     29515 29507  0 13:39 pts/0    00:00:00 bash
Here is the output of the status.dat file for that specific service:
nagios-status-dat2.png
Here is another picture of the command executed on the command line:
nagios-command-line.png
nagios-command-line.png (4.5 KiB) Viewed 6418 times
I'm still puzzled as to why the results are different in the command line and interface. Thanks in advance for the help.

Re: Status critical

Posted: Mon Sep 18, 2017 12:26 pm
by lmiltchev
Have you tried forcing a check from the web gui by clicking on the "Re-schedule the next check of this service" link under the "Service Commands" window? Did the status change?

Can you post the config of the "haproxy process" service?

Re: Status critical

Posted: Mon Sep 18, 2017 12:36 pm
by Rovendra
Hi lmiltchev,

I've tried a force check and nothing changed. I've tried debbuging and it's showing critical status in the logs as well.

Here is the haproxy service configuration file:
haproxy_service_config.PNG
Thanks in advance.