Nagios Service status got frozen
Nagios Service status got frozen
Hi,
Our nagios service status got frozen but we are receiving email notification. We taught its because of db crash. To resolve we done the following
1. Tried DB repair from Nagios script (repair_databases.sh)
2. Tried Mysql repair script ( repairmysql.sh )
3. Updated Nagios to latest version.
4. Restarted all services (server)
But none of the steps helped us to resolve issue. Please help us to fix the issue
Thanks
Our nagios service status got frozen but we are receiving email notification. We taught its because of db crash. To resolve we done the following
1. Tried DB repair from Nagios script (repair_databases.sh)
2. Tried Mysql repair script ( repairmysql.sh )
3. Updated Nagios to latest version.
4. Restarted all services (server)
But none of the steps helped us to resolve issue. Please help us to fix the issue
Thanks
- tacolover101
- Posts: 432
- Joined: Mon Apr 10, 2017 11:55 am
Re: Nagios Service status got frozen
what processes seem to be eating all of your resources? you'll probably want to PM one of the employees your profile.zip.
for the rest of us, please post the output for these commands:
for the rest of us, please post the output for these commands:
Code: Select all
top
free -m
df -H
ps aux
Re: Nagios Service status got frozen
Any update, concise? Can you show us the output of the commands, posted by tacolover101, and send us your profile?
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Nagios Service status got frozen
Sorry for the delayed response. Please check below for your asked command output.
top
free -m
df -H
ps aux
top
Code: Select all
top - 12:08:50 up 27 days, 11:45, 1 user, load average: 0.23, 0.65, 0.73
Tasks: 129 total, 1 running, 128 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.7 us, 1.0 sy, 0.0 ni, 97.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 1882288 total, 1197156 free, 398552 used, 286580 buff/cache
KiB Swap: 2097148 total, 1926424 free, 170724 used. 1287708 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23378 nagios 20 0 204132 10240 3752 S 0.7 0.5 0:00.02 python
1004 ajaxterm 20 0 181140 1016 288 S 0.3 0.1 8:36.60 python
2478 mysql 20 0 1136788 70624 2828 S 0.3 3.8 25:04.78 mysqld
1 root 20 0 192748 2552 1496 S 0.0 0.1 10:41.22 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.09 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 1:34.48 ksoftirqd/0
7 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
9 root 20 0 0 0 0 S 0.0 0.0 4:31.04 rcu_sched
10 root rt 0 0 0 0 S 0.0 0.0 0:11.65 watchdog/0
12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kdevtmpfs
13 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 netns
14 root 20 0 0 0 0 S 0.0 0.0 0:00.01 xenwatch
15 root 20 0 0 0 0 S 0.0 0.0 0:00.00 xenbus
17 root 20 0 0 0 0 S 0.0 0.0 0:00.63 khungtaskd
18 root 0 -20 0 0 0 S 0.0 0.0 0:00.63 writeback
19 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kintegrityd
20 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 bioset
21 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kblockd
22 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 md
27 root 20 0 0 0 0 S 0.0 0.0 0:04.64 kswapd0
28 root 25 5 0 0 0 S 0.0 0.0 0:00.00 ksmd
29 root 39 19 0 0 0 S 0.0 0.0 1:30.80 khugepaged
30 root 20 0 0 0 0 S 0.0 0.0 0:00.00 fsnotify_mark
31 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 crypto
39 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kthrotld
41 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kmpath_rdacd
42 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kpsmoused
44 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 ipv6_addrconf
63 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 deferwq
109 root 20 0 0 0 0 S 0.0 0.0 0:09.65 kauditd
181 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 rpciodCode: Select all
[root@ip-172-31-0-10 centos]# free -m
total used free shared buff/cache available
Mem: 1838 387 1170 27 280 1258
Swap: 2047 166 1881
Code: Select all
[root@ip-172-31-0-10 centos]# df -H
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 108G 9.9G 98G 10% /
devtmpfs 944M 0 944M 0% /dev
tmpfs 964M 0 964M 0% /dev/shm
tmpfs 964M 110M 855M 12% /run
tmpfs 964M 0 964M 0% /sys/fs/cgroup
tmpfs 193M 0 193M 0% /run/user/1001
tmpfs 193M 0 193M 0% /run/user/1000
Code: Select all
[root@ip-172-31-0-10 centos]# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.1 192748 2552 ? Ss Jul05 10:41 /usr/lib/systemd/systemd --switched-root --system --deserialize 20
root 2 0.0 0.0 0 0 ? S Jul05 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? S Jul05 1:34 [ksoftirqd/0]
root 7 0.0 0.0 0 0 ? S Jul05 0:00 [migration/0]
root 8 0.0 0.0 0 0 ? S Jul05 0:00 [rcu_bh]
root 9 0.0 0.0 0 0 ? R Jul05 4:31 [rcu_sched]
root 10 0.0 0.0 0 0 ? S Jul05 0:11 [watchdog/0]
root 12 0.0 0.0 0 0 ? S Jul05 0:00 [kdevtmpfs]
root 13 0.0 0.0 0 0 ? S< Jul05 0:00 [netns]
root 14 0.0 0.0 0 0 ? S Jul05 0:00 [xenwatch]
root 15 0.0 0.0 0 0 ? S Jul05 0:00 [xenbus]
root 17 0.0 0.0 0 0 ? S Jul05 0:00 [khungtaskd]
root 18 0.0 0.0 0 0 ? S< Jul05 0:00 [writeback]
root 19 0.0 0.0 0 0 ? S< Jul05 0:00 [kintegrityd]
root 20 0.0 0.0 0 0 ? S< Jul05 0:00 [bioset]
root 21 0.0 0.0 0 0 ? S< Jul05 0:00 [kblockd]
root 22 0.0 0.0 0 0 ? S< Jul05 0:00 [md]
root 27 0.0 0.0 0 0 ? S Jul05 0:04 [kswapd0]
root 28 0.0 0.0 0 0 ? SN Jul05 0:00 [ksmd]
root 29 0.0 0.0 0 0 ? SN Jul05 1:30 [khugepaged]
root 30 0.0 0.0 0 0 ? S Jul05 0:00 [fsnotify_mark]
root 31 0.0 0.0 0 0 ? S< Jul05 0:00 [crypto]
root 39 0.0 0.0 0 0 ? S< Jul05 0:00 [kthrotld]
root 41 0.0 0.0 0 0 ? S< Jul05 0:00 [kmpath_rdacd]
root 42 0.0 0.0 0 0 ? S< Jul05 0:00 [kpsmoused]
root 44 0.0 0.0 0 0 ? S< Jul05 0:00 [ipv6_addrconf]
root 63 0.0 0.0 0 0 ? S< Jul05 0:00 [deferwq]
root 109 0.0 0.0 0 0 ? S Jul05 0:09 [kauditd]
root 181 0.0 0.0 0 0 ? S< Jul05 0:00 [rpciod]
root 242 0.0 0.0 0 0 ? S< Jul05 0:00 [ata_sff]
root 243 0.0 0.0 0 0 ? S Jul05 0:00 [scsi_eh_0]
root 245 0.0 0.0 0 0 ? S< Jul05 0:00 [scsi_tmf_0]
root 246 0.0 0.0 0 0 ? S Jul05 0:00 [scsi_eh_1]
root 248 0.0 0.0 0 0 ? S< Jul05 0:00 [scsi_tmf_1]
root 269 0.0 0.0 0 0 ? S< Jul05 0:00 [xfsalloc]
root 270 0.0 0.0 0 0 ? S< Jul05 0:00 [xfs_mru_cache]
root 271 0.0 0.0 0 0 ? S< Jul05 0:00 [xfs-buf/xvda1]
root 272 0.0 0.0 0 0 ? S< Jul05 0:00 [xfs-data/xvda1]
root 273 0.0 0.0 0 0 ? S< Jul05 0:00 [xfs-conv/xvda1]
root 274 0.0 0.0 0 0 ? S< Jul05 0:00 [xfs-cil/xvda1]
root 275 0.0 0.0 0 0 ? S< Jul05 0:00 [xfs-reclaim/xvd]
root 276 0.0 0.0 0 0 ? S< Jul05 0:00 [xfs-log/xvda1]
root 277 0.0 0.0 0 0 ? S< Jul05 0:00 [xfs-eofblocks/x]
root 278 0.0 0.0 0 0 ? S Jul05 10:30 [xfsaild/xvda1]
root 354 0.0 0.0 39980 1652 ? Ss Jul05 4:00 /usr/lib/systemd/systemd-journald
root 390 0.0 0.0 45584 268 ? Ss Jul05 0:00 /usr/lib/systemd/systemd-udevd
root 423 0.0 0.0 0 0 ? S< Jul05 0:00 [ttm_swap]
root 425 0.0 0.0 57460 556 ? S<sl Jul05 0:53 /sbin/auditd -n
dbus 482 0.0 0.0 26864 1032 ? Ss Jul05 11:08 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
polkitd 484 0.0 0.0 529664 1320 ? Ssl Jul05 4:36 /usr/lib/polkit-1/polkitd --no-debug
root 507 0.0 0.0 26404 1096 ? Ss Jul05 6:17 /usr/lib/systemd/systemd-logind
chrony 512 0.0 0.0 117892 584 ? S Jul05 0:02 /usr/sbin/chronyd
root 513 0.0 0.0 203256 284 ? Ssl Jul05 0:01 /usr/sbin/gssproxy -D
root 551 0.0 0.0 112080 208 ttyS0 Ss+ Jul05 0:00 /sbin/agetty --keep-baud 115200 38400 9600 ttyS0 vt220
root 703 0.0 0.0 114924 916 ? Ss Jul05 0:01 /sbin/dhclient -H ip-172-31-0-10 -q -lf /var/lib/dhclient/dhclient--eth0.lease -pf /var/run/dhclient-eth0.pid eth0
root 744 0.0 0.0 385660 1868 ? Ssl Jul05 1:53 /usr/sbin/rsyslogd -n
root 747 0.0 0.0 29148 192 ? Ss Jul05 0:00 /usr/sbin/xinetd -stayalive -pidfile /var/run/xinetd.pid
root 748 0.0 0.0 555192 856 ? Ssl Jul05 3:01 /usr/bin/python -Es /usr/sbin/tuned -l -P
root 763 0.0 0.0 107524 464 ? Ss Jul05 0:02 /usr/sbin/sshd -D
ajaxterm 1004 0.0 0.0 181140 1016 ? Sl Jul05 8:36 python /usr/share/ajaxterm/ajaxterm.py --daemon --port=8022 --uid=ajaxterm
root 1329 0.0 0.0 91080 404 ? Ss Jul05 0:03 /usr/libexec/postfix/master -w
postfix 1335 0.0 0.0 91360 364 ? S Jul05 0:00 qmgr -l -t unix -u
nagios 1678 0.0 0.0 57652 204 ? Ss Jul23 0:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg -f
mysql 2247 0.0 0.0 115300 220 ? Ss Jul23 0:00 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
root 2300 0.0 0.0 112080 212 tty1 Ss+ Jul05 0:00 /sbin/agetty --noclear tty1 linux
mysql 2478 0.1 3.7 1136788 70624 ? Sl Jul23 25:05 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/mariadb.log
root 3894 0.0 0.0 0 0 ? S 11:25 0:00 [kworker/u30:0]
root 3895 0.0 0.2 515360 4752 ? Ss Jul23 0:27 /usr/sbin/httpd -DFOREGROUND
root 4688 0.0 0.0 127876 604 ? Ss Jul05 0:56 /usr/sbin/crond -n
apache 5366 0.0 1.0 628044 20116 ? S 09:00 0:09 /usr/sbin/httpd -DFOREGROUND
apache 5376 0.0 1.0 628536 19444 ? S 09:00 0:09 /usr/sbin/httpd -DFOREGROUND
apache 5378 0.0 1.0 628528 20640 ? S 09:00 0:10 /usr/sbin/httpd -DFOREGROUND
apache 5388 0.0 1.0 628544 20552 ? S 09:00 0:09 /usr/sbin/httpd -DFOREGROUND
apache 5646 0.0 1.1 628536 21068 ? S 09:00 0:09 /usr/sbin/httpd -DFOREGROUND
root 7833 0.0 0.0 0 0 ? S 11:34 0:00 [kworker/0:2]
root 13333 0.0 0.0 0 0 ? S< Jul30 0:00 [kworker/0:1H]
apache 15800 0.2 1.0 627248 19648 ? S 11:52 0:02 /usr/sbin/httpd -DFOREGROUND
nagios 15931 0.1 0.4 27552 7976 ? Ss 11:52 0:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 15934 0.0 0.0 12816 1036 ? S 11:52 0:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15935 0.0 0.0 12816 1040 ? S 11:52 0:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15936 0.0 0.0 12816 1032 ? S 11:52 0:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15937 0.0 0.0 12816 1036 ? S 11:52 0:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 15952 0.0 0.2 27032 4780 ? S 11:52 0:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
postfix 16022 0.0 0.2 91184 3984 ? S 10:38 0:00 pickup -l -t unix -u
apache 16209 0.1 1.0 627248 19636 ? S 11:52 0:02 /usr/sbin/httpd -DFOREGROUND
nagios 16447 0.0 0.0 373236 696 ? S Jul23 0:29 /usr/local/nagios/bin/npcd -d -f /usr/local/nagios/etc/pnp/npcd.cfg
apache 16763 0.0 1.6 638824 30268 ? S Jul31 1:14 /usr/sbin/httpd -DFOREGROUND
apache 18413 0.1 1.0 627740 20168 ? S 11:57 0:01 /usr/sbin/httpd -DFOREGROUND
root 20260 0.0 0.0 0 0 ? S 12:02 0:00 [kworker/0:1]
apache 20638 0.1 1.0 626732 19112 ? S 12:02 0:00 /usr/sbin/httpd -DFOREGROUND
root 22975 0.0 0.0 0 0 ? R 12:08 0:00 [kworker/0:0]
root 23123 0.0 0.2 149884 5204 ? Ss 12:08 0:00 sshd: centos [priv]
centos 23126 0.0 0.1 149884 2208 ? S 12:08 0:00 sshd: centos@pts/0
centos 23127 0.0 0.1 117428 2004 pts/0 Ss 12:08 0:00 -bash
root 23267 0.0 0.1 195416 2816 pts/0 S 12:08 0:00 sudo su
root 23268 0.0 0.1 189568 2336 pts/0 S 12:08 0:00 su
root 23269 0.0 0.1 117428 2088 pts/0 S 12:08 0:00 bash
root 24722 0.0 0.1 179772 2092 ? S 12:12 0:00 /usr/sbin/CROND -n
root 24723 0.0 0.1 179772 2092 ? S 12:12 0:00 /usr/sbin/CROND -n
root 24724 0.0 0.1 179772 2092 ? S 12:12 0:00 /usr/sbin/CROND -n
root 24725 0.0 0.1 179772 2092 ? S 12:12 0:00 /usr/sbin/CROND -n
root 24726 0.0 0.1 179772 2092 ? S 12:12 0:00 /usr/sbin/CROND -n
root 24727 0.0 0.1 179772 2092 ? S 12:12 0:00 /usr/sbin/CROND -n
nagios 24728 0.0 0.0 115164 1208 ? Ss 12:12 0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php >> /usr/local/nagiosxi/var/eventman.log 2>&1
nagios 24730 0.0 0.0 115164 1208 ? Ss 12:12 0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/event_handler.php >> /usr/local/nagiosxi/var/event_handler.log 2>&1
nagios 24731 1.5 1.5 443416 30004 ? S 12:12 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php
nagios 24734 0.8 1.2 437012 23752 ? S 12:12 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/event_handler.php
nagios 24737 0.0 0.0 115164 1208 ? Ss 12:12 0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php >> /usr/local/nagiosxi/var/feedproc.log 2>&1
nagios 24739 0.0 0.0 115164 1208 ? Ss 12:12 0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php >> /usr/local/nagiosxi/var/cmdsubsys.log 2>&1
nagios 24741 1.0 1.2 437012 23788 ? S 12:12 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php
nagios 24742 0.0 0.0 115164 1208 ? Ss 12:12 0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php >> /usr/local/nagiosxi/var/perfdataproc.log 2>&1
nagios 24744 0.0 0.0 115164 1208 ? Ss 12:12 0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php >> /usr/local/nagiosxi/var/sysstat.log 2>&1
nagios 24745 0.8 1.2 437268 24136 ? S 12:12 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php
nagios 24746 0.8 1.2 437268 23984 ? S 12:12 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php
nagios 24748 1.0 1.3 437272 24752 ? S 12:12 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php
nagios 24885 0.0 0.1 46296 2988 ? S 12:12 0:00 /usr/local/nagios/libexec/check_nrpe -H vpsp03.concisehosting.com.au -t 30 -c check_yum
nagios 24887 0.0 0.1 46296 2984 ? S 12:12 0:00 /usr/local/nagios/libexec/check_nrpe -H vpsp08.concisehosting.com.au -t 30 -c check_yum
nagios 24893 0.0 0.1 46296 2896 ? S 12:12 0:00 /usr/local/nagios/libexec/check_nrpe -H vpsp04.concisehosting.com.au -t 30 -c check_init_service -a munin-node
nagios 24894 0.0 0.0 122696 976 ? S 12:12 0:00 /usr/local/nagios/libexec/check_icmp -H www.condura.com.au -w 3000.0,80% -c 5000.0,100% -p 5
root 24895 0.0 0.0 153100 1840 pts/0 R+ 12:12 0:00 ps aux
root 28176 0.0 0.0 0 0 ? S< Jul29 0:03 [kworker/0:0H]
root 32483 0.0 0.0 0 0 ? S 06:20 0:00 [kworker/u30:2]
Re: Nagios Service status got frozen
Is the service status still not updating in the XI GUI?
Can you login to the Nagios Core interface, and see if that is updating.
To login to the Core interface, you would use this URL, replacing xxx.xxx.xxx.xxx with the IP address of the server.
If the core interface is not updating then it could be a corrupt retention.dat file and it will have to be removed from the system.
To remove it, login as root and run the following commands.
Give the system 10 minutes to run and see if the status updates.
The downside of doing this it that and saved notes, downtime, will be removed from the system.
Can you login to the Nagios Core interface, and see if that is updating.
To login to the Core interface, you would use this URL, replacing xxx.xxx.xxx.xxx with the IP address of the server.
Code: Select all
http://xxx.xxx.xxx.xxx/nagios/To remove it, login as root and run the following commands.
Code: Select all
service nagios stop
killall -9 nagios
mv /usr/local/nagios/var/retention.dat /usr/local/nagios/var/retention.bak
service nagios startThe downside of doing this it that and saved notes, downtime, will be removed from the system.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Nagios Service status got frozen
Yes the service status in xi gui is still not updating. But nagios core been working good.
Thanks
Thanks
Re: Nagios Service status got frozen
OK, lets run a repair of the MYSQL database and restart some of the services to see if the GUI updates.
Login as root to the server and run the following to do that.
Try that and let us know if this fixes the issue.
Login as root to the server and run the following to do that.
Code: Select all
service nagios stop
service ndo2db stop
mysqlcheck -f -r -u root -pnagiosxi --all-databases
service mysqld restart
service httpd restart
service ndo2db start
service nagios startTry that and let us know if this fixes the issue.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Nagios Service status got frozen
we have tried db repair steps already still have given a try now. It doesn't help to fix the issue. Now i have noticed an error check below for it
Note: using mariadb instead of mysqld
Code: Select all
nagiosxi.xi_incidents
note : The storage engine for the table doesn't support repair
- tacolover101
- Posts: 432
- Joined: Mon Apr 10, 2017 11:55 am
Re: Nagios Service status got frozen
what command did you run to output that?
generally, running through this doc usually is the catch all cure for repairing databases - https://support.nagios.com/kb/article.php?id=24
if that doesn't have it, then this should - https://assets.nagios.com/downloads/nag ... tabase.pdf
it's hard to say why this happens, but it would be really nice if Nagios built a self-healing / health check method for this.
it could be done through CRON running at night, an admin setting, or even a check within Nagios itself, against itself. (the last wouldn't be the greatest as you leave the potential to auto-sql restart loops though.)
generally, running through this doc usually is the catch all cure for repairing databases - https://support.nagios.com/kb/article.php?id=24
if that doesn't have it, then this should - https://assets.nagios.com/downloads/nag ... tabase.pdf
it's hard to say why this happens, but it would be really nice if Nagios built a self-healing / health check method for this.
Re: Nagios Service status got frozen
while running the below command i have noticed that
It response below one
Let me check the 2 guides and get back to you.
Code: Select all
mysqlcheck -f -r -u root -pnagiosxi --all-databases
Code: Select all
nagiosxi.xi_incidents
note : The storage engine for the table doesn't support repair