Status critical
Status critical
Hi all,
I'm new to nagios, so I'm sorry if it's a newbie question . I've got nagios running to monitor some services and one of them is a haproxy service. I had internet issues yesterday and my servers were down for a couple hours. The problem is that after everything is back online and working the monitor of the service won't recover from critical state (the haproxy is up and running).
Any suggestions on what is causing this issue?
Thanks!
I'm new to nagios, so I'm sorry if it's a newbie question . I've got nagios running to monitor some services and one of them is a haproxy service. I had internet issues yesterday and my servers were down for a couple hours. The problem is that after everything is back online and working the monitor of the service won't recover from critical state (the haproxy is up and running).
Any suggestions on what is causing this issue?
Thanks!
- Attachments
-
- haproxy.PNG (4.78 KiB) Viewed 5278 times
Re: Status critical
Hello,
Can you run the haproxy check command from the command line on the server and pust the result?
Thank you!
Can you run the haproxy check command from the command line on the server and pust the result?
Thank you!
Re: Status critical
Ok guys
Just to give an update on what happened since I had this problem. I've finally was able to figure out that the message is critical because I have more then 1 haproxy running on the server. That's why the answer is 2 and not 1, but that still means it's ok. Now I have two more questions ... the first is why the hell haproxy is spawning more then one process since it didn't do that before (the same is happening to my apache ... it's spawning 8 process and it didn't do it before). The second question is why the process is getting critical on the interface since the snmp check i do is:
check_snmp -o 1.3.6.1.4.1.2021.2.1.5.6 -C STGen2016 -r [1-9][0-9]* <server>
and the regex part is supposed to let nagios know that any number of processes is normal.
Any suggestions? And thanks in advace.
Just to give an update on what happened since I had this problem. I've finally was able to figure out that the message is critical because I have more then 1 haproxy running on the server. That's why the answer is 2 and not 1, but that still means it's ok. Now I have two more questions ... the first is why the hell haproxy is spawning more then one process since it didn't do that before (the same is happening to my apache ... it's spawning 8 process and it didn't do it before). The second question is why the process is getting critical on the interface since the snmp check i do is:
check_snmp -o 1.3.6.1.4.1.2021.2.1.5.6 -C STGen2016 -r [1-9][0-9]* <server>
and the regex part is supposed to let nagios know that any number of processes is normal.
Any suggestions? And thanks in advace.
Re: Status critical
Ok guys, I've found out more information. Looking into old pictures of nagios I've found that spawning 2 processes in haproxy and 7 in apache are supposed to be normal. So I'm guessing something changed in nagios (I don't know how that is possible since nobody touched this server) that is now treating multiple processes as a critical status. The strange fact still remains that a status check in the command line returns ok while the interface list their status as critical.
Obs: we use nagiosQL.
I appreciate any help.
Obs: we use nagiosQL.
I appreciate any help.
Re: Status critical
Try running these commands to stop and start the Nagios Daemon.
Logout of the GUI and log back in and see if the status is updated at the next check.
If not, run the following command and post the output
And, open the status.dat file, find that service entry and post it here as well.
Code: Select all
service nagios stop
killall -9 nagios
service nagios start
If not, run the following command and post the output
Code: Select all
ps -ef --cols=300
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Status critical
This is the output of the 'ps -ef --cols=300' command:
Here is the output of the status.dat file for that specific service:
Here is another picture of the command executed on the command line:
I'm still puzzled as to why the results are different in the command line and interface. Thanks in advance for the help.
Code: Select all
[root@help /]# ps -ef --cols=300
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 Apr05 ? 00:00:52 /sbin/init
root 2 0 0 Apr05 ? 00:00:00 [kthreadd]
root 3 2 0 Apr05 ? 00:01:31 [migration/0]
root 4 2 0 Apr05 ? 00:01:17 [ksoftirqd/0]
root 5 2 0 Apr05 ? 00:00:00 [stopper/0]
root 6 2 0 Apr05 ? 00:00:11 [watchdog/0]
root 7 2 0 Apr05 ? 00:00:49 [migration/1]
root 8 2 0 Apr05 ? 00:00:00 [stopper/1]
root 9 2 0 Apr05 ? 00:00:25 [ksoftirqd/1]
root 10 2 0 Apr05 ? 00:00:09 [watchdog/1]
root 11 2 0 Apr05 ? 00:07:01 [events/0]
root 12 2 0 Apr05 ? 01:03:40 [events/1]
root 13 2 0 Apr05 ? 00:00:00 [cgroup]
root 14 2 0 Apr05 ? 00:00:00 [khelper]
root 15 2 0 Apr05 ? 00:00:00 [netns]
root 16 2 0 Apr05 ? 00:00:00 [async/mgr]
root 17 2 0 Apr05 ? 00:00:00 [pm]
root 18 2 0 Apr05 ? 00:00:27 [sync_supers]
root 19 2 0 Apr05 ? 00:00:41 [bdi-default]
root 20 2 0 Apr05 ? 00:00:00 [kintegrityd/0]
root 21 2 0 Apr05 ? 00:00:00 [kintegrityd/1]
root 22 2 0 Apr05 ? 00:09:28 [kblockd/0]
root 23 2 0 Apr05 ? 00:09:50 [kblockd/1]
root 24 2 0 Apr05 ? 00:00:00 [kacpid]
root 25 2 0 Apr05 ? 00:00:00 [kacpi_notify]
root 26 2 0 Apr05 ? 00:00:00 [kacpi_hotplug]
root 27 2 0 Apr05 ? 00:00:00 [ata_aux]
root 28 2 0 Apr05 ? 00:00:00 [ata_sff/0]
root 29 2 0 Apr05 ? 00:00:00 [ata_sff/1]
root 30 2 0 Apr05 ? 00:00:00 [ksuspend_usbd]
root 31 2 0 Apr05 ? 00:00:00 [khubd]
root 32 2 0 Apr05 ? 00:00:00 [kseriod]
root 33 2 0 Apr05 ? 00:00:00 [md/0]
root 34 2 0 Apr05 ? 00:00:00 [md/1]
root 35 2 0 Apr05 ? 00:00:00 [md_misc/0]
root 36 2 0 Apr05 ? 00:00:00 [md_misc/1]
root 37 2 0 Apr05 ? 00:00:00 [linkwatch]
root 39 2 0 Apr05 ? 00:00:03 [khungtaskd]
root 40 2 0 Apr05 ? 00:03:00 [kswapd0]
root 41 2 0 Apr05 ? 00:00:00 [ksmd]
root 42 2 0 Apr05 ? 00:02:23 [khugepaged]
root 43 2 0 Apr05 ? 00:00:00 [aio/0]
root 44 2 0 Apr05 ? 00:00:00 [aio/1]
root 45 2 0 Apr05 ? 00:00:00 [crypto/0]
root 46 2 0 Apr05 ? 00:00:00 [crypto/1]
root 54 2 0 Apr05 ? 00:00:00 [kthrotld/0]
root 55 2 0 Apr05 ? 00:00:00 [kthrotld/1]
root 56 2 0 Apr05 ? 00:00:00 [pciehpd]
root 58 2 0 Apr05 ? 00:00:00 [kpsmoused]
root 59 2 0 Apr05 ? 00:00:00 [usbhid_resumer]
root 60 2 0 Apr05 ? 00:00:00 [deferwq]
root 92 2 0 Apr05 ? 00:00:00 [kdmremove]
root 93 2 0 Apr05 ? 00:00:00 [kstriped]
root 170 2 0 Apr05 ? 00:00:00 [scsi_eh_0]
root 171 2 0 Apr05 ? 00:00:00 [scsi_eh_1]
root 177 2 0 Apr05 ? 00:03:38 [mpt_poll_0]
root 178 2 0 Apr05 ? 00:00:00 [mpt/0]
root 179 2 0 Apr05 ? 00:00:00 [scsi_eh_2]
root 319 2 0 Apr05 ? 00:00:00 [kdmflush]
root 321 2 0 Apr05 ? 00:00:00 [kdmflush]
root 338 2 0 Apr05 ? 00:58:12 [jbd2/dm-0-8]
root 339 2 0 Apr05 ? 00:00:00 [ext4-dio-unwrit]
root 427 1 0 Apr05 ? 00:00:00 /sbin/udevd -d
ntp 630 1 0 Apr14 ? 00:00:27 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
root 640 2 0 Apr05 ? 00:02:55 [vmmemctl]
root 769 2 0 Apr05 ? 00:00:00 [jbd2/sda1-8]
root 770 2 0 Apr05 ? 00:00:00 [ext4-dio-unwrit]
root 807 2 0 Apr05 ? 00:00:22 [kauditd]
root 900 2 0 Apr05 ? 00:15:12 [flush-253:0]
root 1186 1 0 Apr05 ? 01:39:41 /usr/sbin/vmtoolsd
root 1213 1 0 Apr05 ? 00:00:00 /usr/lib/vmware-vgauth/VGAuthService -s
root 1302 1 0 Apr05 ? 00:00:28 /sbin/dhclient -1 -q -lf /var/lib/dhclient/dhclient-eth0.leases -pf /var/run/dhclient-eth0.pid eth0
root 1362 1 0 Apr05 ? 00:01:08 auditd
root 1392 1 0 Apr05 ? 00:00:35 /sbin/rsyslogd -i /var/run/syslogd.pid -c 5
named 1417 1 0 Apr05 ? 00:00:48 /usr/sbin/named -u named
root 1469 1 0 Apr05 ? 00:00:44 /usr/sbin/sshd
root 1513 1 0 Apr05 ? 00:00:00 /bin/sh /usr/bin/mysqld_safe --datadir=/var/lib/mysql --socket=/var/lib/mysql/mysql.sock --pid-file=/var/run/mysqld/mysqld.pid --basedir=/usr --user=mysql
mysql 1618 1513 0 Apr05 ? 19:01:00 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
root 1711 1 0 Apr05 ? 00:00:50 /usr/libexec/postfix/master
postfix 1721 1711 0 Apr05 ? 00:00:22 qmgr -l -t fifo -u
root 1722 1 0 Apr05 ? 00:06:30 /usr/sbin/httpd
root 1732 1 0 Apr05 ? 00:01:34 crond
nagios 1788 1 0 Apr05 ? 00:47:41 /usr/local/pnp4nagios/bin/npcd -d -f /usr/local/pnp4nagios/etc/npcd.cfg
root 1798 1 0 Apr05 tty2 00:00:00 /sbin/mingetty /dev/tty2
root 1800 1 0 Apr05 tty3 00:00:00 /sbin/mingetty /dev/tty3
root 1802 1 0 Apr05 tty4 00:00:00 /sbin/mingetty /dev/tty4
root 1804 1 0 Apr05 tty5 00:00:00 /sbin/mingetty /dev/tty5
root 1806 1 0 Apr05 tty6 00:00:00 /sbin/mingetty /dev/tty6
root 1813 427 0 Apr05 ? 00:00:00 /sbin/udevd -d
root 1814 427 0 Apr05 ? 00:00:00 /sbin/udevd -d
root 2835 1 0 Apr06 tty1 00:00:00 /sbin/mingetty /dev/tty1
postfix 4826 1711 0 14:45 ? 00:00:00 pickup -l -t fifo -u
apache 4983 1722 0 05:35 ? 00:00:02 /usr/sbin/httpd
apache 5092 1722 0 05:36 ? 00:00:02 /usr/sbin/httpd
apache 5095 1722 0 05:36 ? 00:00:03 /usr/sbin/httpd
apache 5096 1722 0 05:36 ? 00:00:02 /usr/sbin/httpd
apache 5785 1722 0 05:42 ? 00:00:02 /usr/sbin/httpd
nagios 6081 1 0 14:56 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 6083 6081 0 14:56 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 6084 6081 0 14:56 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 6085 6081 0 14:56 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 6086 6081 0 14:56 ? 00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 6087 6081 0 14:56 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root 6348 19053 0 14:58 pts/1 00:00:00 vi status.dat
nagios 7518 6084 0 15:07 ? 00:00:00 /usr/local/nagios/libexec/check_ping -H 177.71.17.71 -w 1000.0,80% -c 2000.0,100% -p 5 -4
nagios 7519 7518 0 15:07 ? 00:00:00 /bin/ping -n -U -w 15 -c 5 177.71.17.71
nagios 7534 6086 0 15:07 ? 00:00:00 /usr/bin/perl -w /usr/local/nagios/libexec/check_snmp_oid -H <server> -p 1161 -o 1.3.6.1.4.1.42.2.145.3.163.1.1.2.11.0 -C STGen2016
root 7535 29515 0 15:07 pts/0 00:00:00 ps -ef --cols=300
apache 8631 1722 0 06:07 ? 00:00:03 /usr/sbin/httpd
apache 9425 1722 0 10:52 ? 00:00:01 /usr/sbin/httpd
apache 10037 1722 0 06:19 ? 00:00:02 /usr/sbin/httpd
apache 10575 1722 0 06:23 ? 00:00:03 /usr/sbin/httpd
apache 11829 1722 0 11:11 ? 00:00:01 /usr/sbin/httpd
root 19036 1469 0 12:09 ? 00:00:00 sshd: root@pts/1
root 19053 19036 0 12:09 pts/1 00:00:00 -bash
root 23837 1469 0 08:17 ? 00:00:01 sshd: root@pts/0
root 23855 23837 0 08:17 pts/0 00:00:00 -bash
root 29463 23855 0 13:39 pts/0 00:00:00 su nagios
nagios 29464 29463 0 13:39 pts/0 00:00:00 bash
root 29507 29464 0 13:39 pts/0 00:00:00 su root
root 29515 29507 0 13:39 pts/0 00:00:00 bash
Here is another picture of the command executed on the command line:
I'm still puzzled as to why the results are different in the command line and interface. Thanks in advance for the help.
Re: Status critical
Have you tried forcing a check from the web gui by clicking on the "Re-schedule the next check of this service" link under the "Service Commands" window? Did the status change?
Can you post the config of the "haproxy process" service?
Can you post the config of the "haproxy process" service?
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Status critical
Hi lmiltchev,
I've tried a force check and nothing changed. I've tried debbuging and it's showing critical status in the logs as well.
Here is the haproxy service configuration file:
Thanks in advance.
I've tried a force check and nothing changed. I've tried debbuging and it's showing critical status in the logs as well.
Here is the haproxy service configuration file:
Thanks in advance.