Nagios Service status got frozen

concise · Post by **concise** » Sat Jul 22, 2017 6:11 am

Hi,

Our nagios service status got frozen but we are receiving email notification. We taught its because of db crash. To resolve we done the following
1. Tried DB repair from Nagios script (repair_databases.sh)
2. Tried Mysql repair script ( repairmysql.sh )
3. Updated Nagios to latest version.
4. Restarted all services (server)

But none of the steps helped us to resolve issue. Please help us to fix the issue

Thanks

Post by **tacolover101** » Sun Jul 23, 2017 11:02 pm

what processes seem to be eating all of your resources? you'll probably want to PM one of the employees your profile.zip.

for the rest of us, please post the output for these commands:

Code: Select all

top
free -m
df -H
ps aux

Post by **lmiltchev** » Mon Jul 24, 2017 9:25 am

Any update, concise? Can you show us the output of the commands, posted by tacolover101, and send us your profile?

concise · Post by **concise** » Mon Jul 31, 2017 11:15 pm

Sorry for the delayed response. Please check below for your asked command output.

top

Code: Select all

top - 12:08:50 up 27 days, 11:45,  1 user,  load average: 0.23, 0.65, 0.73
Tasks: 129 total,   1 running, 128 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.7 us,  1.0 sy,  0.0 ni, 97.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1882288 total,  1197156 free,   398552 used,   286580 buff/cache
KiB Swap:  2097148 total,  1926424 free,   170724 used.  1287708 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
23378 nagios    20   0  204132  10240   3752 S  0.7  0.5   0:00.02 python
 1004 ajaxterm  20   0  181140   1016    288 S  0.3  0.1   8:36.60 python
 2478 mysql     20   0 1136788  70624   2828 S  0.3  3.8  25:04.78 mysqld
    1 root      20   0  192748   2552   1496 S  0.0  0.1  10:41.22 systemd
    2 root      20   0       0      0      0 S  0.0  0.0   0:00.09 kthreadd
    3 root      20   0       0      0      0 S  0.0  0.0   1:34.48 ksoftirqd/0
    7 root      rt   0       0      0      0 S  0.0  0.0   0:00.00 migration/0
    8 root      20   0       0      0      0 S  0.0  0.0   0:00.00 rcu_bh
    9 root      20   0       0      0      0 S  0.0  0.0   4:31.04 rcu_sched
   10 root      rt   0       0      0      0 S  0.0  0.0   0:11.65 watchdog/0
   12 root      20   0       0      0      0 S  0.0  0.0   0:00.00 kdevtmpfs
   13 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 netns
   14 root      20   0       0      0      0 S  0.0  0.0   0:00.01 xenwatch
   15 root      20   0       0      0      0 S  0.0  0.0   0:00.00 xenbus
   17 root      20   0       0      0      0 S  0.0  0.0   0:00.63 khungtaskd
   18 root       0 -20       0      0      0 S  0.0  0.0   0:00.63 writeback
   19 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 kintegrityd
   20 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 bioset
   21 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 kblockd
   22 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 md
   27 root      20   0       0      0      0 S  0.0  0.0   0:04.64 kswapd0
   28 root      25   5       0      0      0 S  0.0  0.0   0:00.00 ksmd
   29 root      39  19       0      0      0 S  0.0  0.0   1:30.80 khugepaged
   30 root      20   0       0      0      0 S  0.0  0.0   0:00.00 fsnotify_mark
   31 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 crypto
   39 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 kthrotld
   41 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 kmpath_rdacd
   42 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 kpsmoused
   44 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 ipv6_addrconf
   63 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 deferwq
  109 root      20   0       0      0      0 S  0.0  0.0   0:09.65 kauditd
  181 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 rpciod

free -m

Code: Select all

[root@ip-172-31-0-10 centos]# free -m
              total        used        free      shared  buff/cache   available
Mem:           1838         387        1170          27         280        1258
Swap:          2047         166        1881

df -H

Code: Select all

[root@ip-172-31-0-10 centos]# df -H
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      108G  9.9G   98G  10% /
devtmpfs        944M     0  944M   0% /dev
tmpfs           964M     0  964M   0% /dev/shm
tmpfs           964M  110M  855M  12% /run
tmpfs           964M     0  964M   0% /sys/fs/cgroup
tmpfs           193M     0  193M   0% /run/user/1001
tmpfs           193M     0  193M   0% /run/user/1000

ps aux

Code: Select all


[root@ip-172-31-0-10 centos]# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.1 192748  2552 ?        Ss   Jul05  10:41 /usr/lib/systemd/systemd --switched-root --system --deserialize 20
root         2  0.0  0.0      0     0 ?        S    Jul05   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    Jul05   1:34 [ksoftirqd/0]
root         7  0.0  0.0      0     0 ?        S    Jul05   0:00 [migration/0]
root         8  0.0  0.0      0     0 ?        S    Jul05   0:00 [rcu_bh]
root         9  0.0  0.0      0     0 ?        R    Jul05   4:31 [rcu_sched]
root        10  0.0  0.0      0     0 ?        S    Jul05   0:11 [watchdog/0]
root        12  0.0  0.0      0     0 ?        S    Jul05   0:00 [kdevtmpfs]
root        13  0.0  0.0      0     0 ?        S<   Jul05   0:00 [netns]
root        14  0.0  0.0      0     0 ?        S    Jul05   0:00 [xenwatch]
root        15  0.0  0.0      0     0 ?        S    Jul05   0:00 [xenbus]
root        17  0.0  0.0      0     0 ?        S    Jul05   0:00 [khungtaskd]
root        18  0.0  0.0      0     0 ?        S<   Jul05   0:00 [writeback]
root        19  0.0  0.0      0     0 ?        S<   Jul05   0:00 [kintegrityd]
root        20  0.0  0.0      0     0 ?        S<   Jul05   0:00 [bioset]
root        21  0.0  0.0      0     0 ?        S<   Jul05   0:00 [kblockd]
root        22  0.0  0.0      0     0 ?        S<   Jul05   0:00 [md]
root        27  0.0  0.0      0     0 ?        S    Jul05   0:04 [kswapd0]
root        28  0.0  0.0      0     0 ?        SN   Jul05   0:00 [ksmd]
root        29  0.0  0.0      0     0 ?        SN   Jul05   1:30 [khugepaged]
root        30  0.0  0.0      0     0 ?        S    Jul05   0:00 [fsnotify_mark]
root        31  0.0  0.0      0     0 ?        S<   Jul05   0:00 [crypto]
root        39  0.0  0.0      0     0 ?        S<   Jul05   0:00 [kthrotld]
root        41  0.0  0.0      0     0 ?        S<   Jul05   0:00 [kmpath_rdacd]
root        42  0.0  0.0      0     0 ?        S<   Jul05   0:00 [kpsmoused]
root        44  0.0  0.0      0     0 ?        S<   Jul05   0:00 [ipv6_addrconf]
root        63  0.0  0.0      0     0 ?        S<   Jul05   0:00 [deferwq]
root       109  0.0  0.0      0     0 ?        S    Jul05   0:09 [kauditd]
root       181  0.0  0.0      0     0 ?        S<   Jul05   0:00 [rpciod]
root       242  0.0  0.0      0     0 ?        S<   Jul05   0:00 [ata_sff]
root       243  0.0  0.0      0     0 ?        S    Jul05   0:00 [scsi_eh_0]
root       245  0.0  0.0      0     0 ?        S<   Jul05   0:00 [scsi_tmf_0]
root       246  0.0  0.0      0     0 ?        S    Jul05   0:00 [scsi_eh_1]
root       248  0.0  0.0      0     0 ?        S<   Jul05   0:00 [scsi_tmf_1]
root       269  0.0  0.0      0     0 ?        S<   Jul05   0:00 [xfsalloc]
root       270  0.0  0.0      0     0 ?        S<   Jul05   0:00 [xfs_mru_cache]
root       271  0.0  0.0      0     0 ?        S<   Jul05   0:00 [xfs-buf/xvda1]
root       272  0.0  0.0      0     0 ?        S<   Jul05   0:00 [xfs-data/xvda1]
root       273  0.0  0.0      0     0 ?        S<   Jul05   0:00 [xfs-conv/xvda1]
root       274  0.0  0.0      0     0 ?        S<   Jul05   0:00 [xfs-cil/xvda1]
root       275  0.0  0.0      0     0 ?        S<   Jul05   0:00 [xfs-reclaim/xvd]
root       276  0.0  0.0      0     0 ?        S<   Jul05   0:00 [xfs-log/xvda1]
root       277  0.0  0.0      0     0 ?        S<   Jul05   0:00 [xfs-eofblocks/x]
root       278  0.0  0.0      0     0 ?        S    Jul05  10:30 [xfsaild/xvda1]
root       354  0.0  0.0  39980  1652 ?        Ss   Jul05   4:00 /usr/lib/systemd/systemd-journald
root       390  0.0  0.0  45584   268 ?        Ss   Jul05   0:00 /usr/lib/systemd/systemd-udevd
root       423  0.0  0.0      0     0 ?        S<   Jul05   0:00 [ttm_swap]
root       425  0.0  0.0  57460   556 ?        S<sl Jul05   0:53 /sbin/auditd -n
dbus       482  0.0  0.0  26864  1032 ?        Ss   Jul05  11:08 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
polkitd    484  0.0  0.0 529664  1320 ?        Ssl  Jul05   4:36 /usr/lib/polkit-1/polkitd --no-debug
root       507  0.0  0.0  26404  1096 ?        Ss   Jul05   6:17 /usr/lib/systemd/systemd-logind
chrony     512  0.0  0.0 117892   584 ?        S    Jul05   0:02 /usr/sbin/chronyd
root       513  0.0  0.0 203256   284 ?        Ssl  Jul05   0:01 /usr/sbin/gssproxy -D
root       551  0.0  0.0 112080   208 ttyS0    Ss+  Jul05   0:00 /sbin/agetty --keep-baud 115200 38400 9600 ttyS0 vt220
root       703  0.0  0.0 114924   916 ?        Ss   Jul05   0:01 /sbin/dhclient -H ip-172-31-0-10 -q -lf /var/lib/dhclient/dhclient--eth0.lease -pf /var/run/dhclient-eth0.pid eth0
root       744  0.0  0.0 385660  1868 ?        Ssl  Jul05   1:53 /usr/sbin/rsyslogd -n
root       747  0.0  0.0  29148   192 ?        Ss   Jul05   0:00 /usr/sbin/xinetd -stayalive -pidfile /var/run/xinetd.pid
root       748  0.0  0.0 555192   856 ?        Ssl  Jul05   3:01 /usr/bin/python -Es /usr/sbin/tuned -l -P
root       763  0.0  0.0 107524   464 ?        Ss   Jul05   0:02 /usr/sbin/sshd -D
ajaxterm  1004  0.0  0.0 181140  1016 ?        Sl   Jul05   8:36 python /usr/share/ajaxterm/ajaxterm.py --daemon --port=8022 --uid=ajaxterm
root      1329  0.0  0.0  91080   404 ?        Ss   Jul05   0:03 /usr/libexec/postfix/master -w
postfix   1335  0.0  0.0  91360   364 ?        S    Jul05   0:00 qmgr -l -t unix -u
nagios    1678  0.0  0.0  57652   204 ?        Ss   Jul23   0:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg -f
mysql     2247  0.0  0.0 115300   220 ?        Ss   Jul23   0:00 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
root      2300  0.0  0.0 112080   212 tty1     Ss+  Jul05   0:00 /sbin/agetty --noclear tty1 linux
mysql     2478  0.1  3.7 1136788 70624 ?       Sl   Jul23  25:05 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/mariadb.log
root      3894  0.0  0.0      0     0 ?        S    11:25   0:00 [kworker/u30:0]
root      3895  0.0  0.2 515360  4752 ?        Ss   Jul23   0:27 /usr/sbin/httpd -DFOREGROUND
root      4688  0.0  0.0 127876   604 ?        Ss   Jul05   0:56 /usr/sbin/crond -n
apache    5366  0.0  1.0 628044 20116 ?        S    09:00   0:09 /usr/sbin/httpd -DFOREGROUND
apache    5376  0.0  1.0 628536 19444 ?        S    09:00   0:09 /usr/sbin/httpd -DFOREGROUND
apache    5378  0.0  1.0 628528 20640 ?        S    09:00   0:10 /usr/sbin/httpd -DFOREGROUND
apache    5388  0.0  1.0 628544 20552 ?        S    09:00   0:09 /usr/sbin/httpd -DFOREGROUND
apache    5646  0.0  1.1 628536 21068 ?        S    09:00   0:09 /usr/sbin/httpd -DFOREGROUND
root      7833  0.0  0.0      0     0 ?        S    11:34   0:00 [kworker/0:2]
root     13333  0.0  0.0      0     0 ?        S<   Jul30   0:00 [kworker/0:1H]
apache   15800  0.2  1.0 627248 19648 ?        S    11:52   0:02 /usr/sbin/httpd -DFOREGROUND
nagios   15931  0.1  0.4  27552  7976 ?        Ss   11:52   0:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   15934  0.0  0.0  12816  1036 ?        S    11:52   0:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15935  0.0  0.0  12816  1040 ?        S    11:52   0:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15936  0.0  0.0  12816  1032 ?        S    11:52   0:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15937  0.0  0.0  12816  1036 ?        S    11:52   0:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   15952  0.0  0.2  27032  4780 ?        S    11:52   0:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
postfix  16022  0.0  0.2  91184  3984 ?        S    10:38   0:00 pickup -l -t unix -u
apache   16209  0.1  1.0 627248 19636 ?        S    11:52   0:02 /usr/sbin/httpd -DFOREGROUND
nagios   16447  0.0  0.0 373236   696 ?        S    Jul23   0:29 /usr/local/nagios/bin/npcd -d -f /usr/local/nagios/etc/pnp/npcd.cfg
apache   16763  0.0  1.6 638824 30268 ?        S    Jul31   1:14 /usr/sbin/httpd -DFOREGROUND
apache   18413  0.1  1.0 627740 20168 ?        S    11:57   0:01 /usr/sbin/httpd -DFOREGROUND
root     20260  0.0  0.0      0     0 ?        S    12:02   0:00 [kworker/0:1]
apache   20638  0.1  1.0 626732 19112 ?        S    12:02   0:00 /usr/sbin/httpd -DFOREGROUND
root     22975  0.0  0.0      0     0 ?        R    12:08   0:00 [kworker/0:0]
root     23123  0.0  0.2 149884  5204 ?        Ss   12:08   0:00 sshd: centos [priv]
centos   23126  0.0  0.1 149884  2208 ?        S    12:08   0:00 sshd: centos@pts/0
centos   23127  0.0  0.1 117428  2004 pts/0    Ss   12:08   0:00 -bash
root     23267  0.0  0.1 195416  2816 pts/0    S    12:08   0:00 sudo su
root     23268  0.0  0.1 189568  2336 pts/0    S    12:08   0:00 su
root     23269  0.0  0.1 117428  2088 pts/0    S    12:08   0:00 bash
root     24722  0.0  0.1 179772  2092 ?        S    12:12   0:00 /usr/sbin/CROND -n
root     24723  0.0  0.1 179772  2092 ?        S    12:12   0:00 /usr/sbin/CROND -n
root     24724  0.0  0.1 179772  2092 ?        S    12:12   0:00 /usr/sbin/CROND -n
root     24725  0.0  0.1 179772  2092 ?        S    12:12   0:00 /usr/sbin/CROND -n
root     24726  0.0  0.1 179772  2092 ?        S    12:12   0:00 /usr/sbin/CROND -n
root     24727  0.0  0.1 179772  2092 ?        S    12:12   0:00 /usr/sbin/CROND -n
nagios   24728  0.0  0.0 115164  1208 ?        Ss   12:12   0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php >> /usr/local/nagiosxi/var/eventman.log 2>&1
nagios   24730  0.0  0.0 115164  1208 ?        Ss   12:12   0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/event_handler.php >> /usr/local/nagiosxi/var/event_handler.log 2>&1
nagios   24731  1.5  1.5 443416 30004 ?        S    12:12   0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php
nagios   24734  0.8  1.2 437012 23752 ?        S    12:12   0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/event_handler.php
nagios   24737  0.0  0.0 115164  1208 ?        Ss   12:12   0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php >> /usr/local/nagiosxi/var/feedproc.log 2>&1
nagios   24739  0.0  0.0 115164  1208 ?        Ss   12:12   0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php >> /usr/local/nagiosxi/var/cmdsubsys.log 2>&1
nagios   24741  1.0  1.2 437012 23788 ?        S    12:12   0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php
nagios   24742  0.0  0.0 115164  1208 ?        Ss   12:12   0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php >> /usr/local/nagiosxi/var/perfdataproc.log 2>&1
nagios   24744  0.0  0.0 115164  1208 ?        Ss   12:12   0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php >> /usr/local/nagiosxi/var/sysstat.log 2>&1
nagios   24745  0.8  1.2 437268 24136 ?        S    12:12   0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php
nagios   24746  0.8  1.2 437268 23984 ?        S    12:12   0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php
nagios   24748  1.0  1.3 437272 24752 ?        S    12:12   0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php
nagios   24885  0.0  0.1  46296  2988 ?        S    12:12   0:00 /usr/local/nagios/libexec/check_nrpe -H vpsp03.concisehosting.com.au -t 30 -c check_yum
nagios   24887  0.0  0.1  46296  2984 ?        S    12:12   0:00 /usr/local/nagios/libexec/check_nrpe -H vpsp08.concisehosting.com.au -t 30 -c check_yum
nagios   24893  0.0  0.1  46296  2896 ?        S    12:12   0:00 /usr/local/nagios/libexec/check_nrpe -H vpsp04.concisehosting.com.au -t 30 -c check_init_service -a munin-node
nagios   24894  0.0  0.0 122696   976 ?        S    12:12   0:00 /usr/local/nagios/libexec/check_icmp -H www.condura.com.au -w 3000.0,80% -c 5000.0,100% -p 5
root     24895  0.0  0.0 153100  1840 pts/0    R+   12:12   0:00 ps aux
root     28176  0.0  0.0      0     0 ?        S<   Jul29   0:03 [kworker/0:0H]
root     32483  0.0  0.0      0     0 ?        S    06:20   0:00 [kworker/u30:2]

Post by **tgriep** » Tue Aug 01, 2017 3:18 pm

Is the service status still not updating in the XI GUI?

Can you login to the Nagios Core interface, and see if that is updating.

To login to the Core interface, you would use this URL, replacing xxx.xxx.xxx.xxx with the IP address of the server.

Code: Select all

http://xxx.xxx.xxx.xxx/nagios/

If the core interface is not updating then it could be a corrupt retention.dat file and it will have to be removed from the system.
To remove it, login as root and run the following commands.

Code: Select all

service nagios stop
killall -9 nagios
mv /usr/local/nagios/var/retention.dat /usr/local/nagios/var/retention.bak
service nagios start

Give the system 10 minutes to run and see if the status updates.

The downside of doing this it that and saved notes, downtime, will be removed from the system.

concise · Post by **concise** » Tue Aug 01, 2017 10:32 pm

Yes the service status in xi gui is still not updating. But nagios core been working good.

Thanks

Post by **tgriep** » Wed Aug 02, 2017 11:51 am

OK, lets run a repair of the MYSQL database and restart some of the services to see if the GUI updates.
Login as root to the server and run the following to do that.

Code: Select all

service nagios stop
service ndo2db stop
mysqlcheck -f -r -u root -pnagiosxi --all-databases
service mysqld restart
service httpd restart
service ndo2db start
service nagios start

Try that and let us know if this fixes the issue.

concise · Post by **concise** » Wed Aug 02, 2017 8:03 pm

we have tried db repair steps already still have given a try now. It doesn't help to fix the issue. Now i have noticed an error check below for it

Code: Select all

nagiosxi.xi_incidents
note     : The storage engine for the table doesn't support repair

Note: using mariadb instead of mysqld

Post by **tacolover101** » Thu Aug 03, 2017 1:12 am

what command did you run to output that?

generally, running through this doc usually is the catch all cure for repairing databases - https://support.nagios.com/kb/article.php?id=24

if that doesn't have it, then this should - https://assets.nagios.com/downloads/nag ... tabase.pdf

it's hard to say why this happens, but it would be really nice if Nagios built a self-healing / health check method for this.

it could be done through CRON running at night, an admin setting, or even a check within Nagios itself, against itself. (the last wouldn't be the greatest as you leave the potential to auto-sql restart loops though.)

concise · Post by **concise** » Thu Aug 03, 2017 6:47 am

while running the below command i have noticed that

Code: Select all

mysqlcheck -f -r -u root -pnagiosxi --all-databases

It response below one

Code: Select all

nagiosxi.xi_incidents
note     : The storage engine for the table doesn't support repair

Let me check the 2 guides and get back to you.

Nagios Support Forum

Nagios Service status got frozen

Nagios Service status got frozen

Re: Nagios Service status got frozen

Re: Nagios Service status got frozen

Re: Nagios Service status got frozen

Re: Nagios Service status got frozen

Re: Nagios Service status got frozen

Re: Nagios Service status got frozen

Re: Nagios Service status got frozen

Re: Nagios Service status got frozen

Re: Nagios Service status got frozen