Nagios Support Forum

Posted: **Thu Nov 16, 2017 11:13 am**

Hey Gents,

I am hoping you can help me with an issue. I have had Nagios XI running for about a year with zero issues concerning the actual operating system, until recently. To give you a rundown, you guys have helped me with several different discussions on my Database problems, this may or may not be related.

Recently, I have not done any OS updates, or Nagios XI updates. At the time of this incident my setup at a glance is below:

Nagios XI 5.4.8
CentoOS 6.8
VMware
FYI, I have crontab rebooting the box everyday at 2am (shutdown -r now)
No, I do not have anything monitoring Nagios XI box. I realize best practice is to have a Nagios Core monitoring my Nagios XI

ISSUE
Last Sunday, for the first time ever, the OS was completely bricked. I couldn't ping it, or even KVM into it. Total Brick. This had never happened before, I just killed the machine in VMware and started it back up. As a precaution, I cloned the machine then I ran system updates "yum update" in hopes of quick resolution. However, now I have a problem of it happened again last night, thus questioning my confidence in it. I was at home so I did a quick kill in VMware, then booted it up. It came up without any errors. However, the database had errors so I ran the repair_database script. It completed within 2 minutes or so but bricked about 30 seconds after completing the repair. I am having trouble finding what to look at to pinpoint the issue. For production purposes I disconnected that box from the network and used the clone as my primary. Theoretically, I could replicate that issue if need be; run the script and brick it.

Edit I did re-run the script on the old box with a snapshot. While script does complete and it doesn't brick. I can IP it but when I try to access nagios web interface is redirects me to the Nagios Installer webpage /nagiosxi/install.php.

Looking at /var/log/messages I hoped to find an issue or something pointing me in the right direction. However, it just stops logging....
I have attached a large piece of the log file, before and after. Here is a glimpse of what I am seeing:

At line 245 Nov 15 22:28:23 NAGIOS ndo2db: Trimming eventhandlers. is the last thing logged.
At line 246 Nov 15 22:47:21 NAGIOS kernel: imklog 5.8.10, log source = /proc/kmsg started.

I expected to see a failure of some sort. Instead the last log insert is 22:28 and the 22:47 is me booting it up. The way I see it, there are 2 large variables, the OS and Nagios. Today, I reverted to a time prior to the OS system updates described and then updated Nagios XI from 5.4.8 to 5.4.11 which I am currently sitting at hoping it doesn't brick again.

Attached is the large portion of /var/log/messages. Please let me know if there is anything else that would help...I obviously have console access to the old box still and can get anything off of it, or test anything in it, or the current Nagios XI. If any file or info request please specify "old" or "current/new" Nagios box.

Posted: **Thu Nov 16, 2017 1:52 pm**

Hello ahoward12,
Is there a reason why you're rebooting your CentOS box nightly? That would be a very unusual practice. One of the high points of the Linux operating system is there there is seldom a reason to reboot the OS. There could be any number of causes for your issue... most of them directly resulting from your nightly reboot. The first thing that comes to mind is that your OS schedules a filesystem check every nth reboot. This is the default in CentOS. Post the output of:

Code: Select all

cat /etc/fstab

And I can confirm.
Another possibility regarding your database issue is that there was database activity at the time of the reboot resulting in corruption. Run the repair script again redirecting the output to a file like this:

Code: Select all

repair_databases.sh >> repair.txt

And attach repair.txt.
To be clear, best practice would be to only reboot your operating system when there is a bonafide reason to do so. ie: yum update installs a new kernel.

Posted: **Thu Nov 16, 2017 2:30 pm**

I reboot it nightly just to conserve resources, I have several other boxes that also reboot nightly, simply just to "clear the chalkboard" of any items that are hung possibly not working properly. It also keeps the box from chewing through memory. It has been doing this nightly reboot since nearly the first day it hit production, give or take 8 months. I would find it hard to believe that this has never happened, then happens twice in one week. However, there is no point in asking for help if I am not willing to take it. **I will remove the crontab entry.

Neverthless here is my ftab:

Code: Select all

#
# /etc/fstab
# Created by anaconda on Fri Apr  5 10:41:28 2013
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/VolGroup-lv_root /                       ext4    defaults        1 1
UUID=8defe099-c9d2-4afa-8a51-183112e7be88 /boot                   ext4    defaults        1 2
/dev/mapper/VolGroup-lv_swap swap                    swap    defaults        0 0
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0

I understand your logic to get the output of

Code: Select all

[root@NAGIOS scripts]# ./repair_databases.sh >> /var/log/repair_database_output.txt
PHP:  syntax error, unexpected '=' in /etc/php.d/mcrypt.ini on line 1
No log handling enabled - turning on stderr logging
Cannot find module (EtherLike-MIB): At line 0 in (none)

Piping the command will not work either

Posted: **Thu Nov 16, 2017 5:48 pm**

Hello, @ahoward12. When you said:

It completed within 2 minutes or so but bricked about 30 seconds after completing the repair.

Did you mean that the screen went black again and you couldn't ping the server?

I'd like to get the whole XI profile. You can generate it manually using the script at /usr/local/nagiosxi/html/includes/components/profile/getprofile.sh That should generate a profile /usr/local/nagiosxi/var/components/ which you can get off the server with an application such as FileZilla. You can upload a zip file with your next post or you could use a cloud service like google drive and share a link in pm.

Also, I agree with @bolson. I think your problem is more likely caused by OS corruption as a result of an unexpected reboot when the system was in a middle of doing something. I haven't really heard of any problems caused by Nagios where the screen would go black and the whole server would become unreachable.

Posted: **Thu Nov 16, 2017 5:56 pm**

The first command below tells you wen the the filesystem was last checked. The second tells you how many reboots between automatic filesystem checks if checking is enabled (-1 indicates checking is disabled)

Code: Select all

tune2fs -l /dev/mapper/VolGroup-lv_root | grep Last\ c 
tune2fs -l /dev/mapper/VolGroup-lv_root | grep Mount

Where I'm going with this is that the box would be unreachable during an fsck check as the check occurs before the / mount takes place. In other words, pre-boot.

The reason why rebooting your host nightly is strongly advised against is that Nagios runs cron jobs which (obviously) can't run if the box is down. In addition, rebooting in this manner would abort any running processes which could cause corruption. Also, I don't really understand the idea of "clearing the chalkboard" or "the box chewing through memory" When a Linux box is showing higher than expected memory usage with the free command, it usually is the result of caching... which is exactly what you want your box to do.

Also, I'm perplexed at your inability to redirect the output of the repair script to a file. It works fine for me. I think that may be a red herring and it's more likely a failure of the script itself. Try running it again without the redirect.

As to why you could go months without an issue and then see it twice in a week. I'd use the 100 year flood plane analogy... You can live in a 100 year flood plane and get flooded out twice in the same year.

Posted: **Fri Nov 17, 2017 1:29 pm**

@npolovenko

Yes the screen was completely frozen and I couldn't do anything. I have uploaded the profile as well. I disabled the nightly reboot at both of your recommendations.

@bolson
Here is the output. The last check doesn't correlate with any of my problems.

[root@NAGIOS ~]# tune2fs -l /dev/mapper/VolGroup-lv_root | grep Last\ c
Last checked: Thu Oct 26 02:00:46 2017
[root@NAGIOS ~]# tune2fs -l /dev/mapper/VolGroup-lv_root | grep Mount
Mount count: 14
[root@NAGIOS ~]#

I understand the box would unreachable during this time, nevertheless I would see the fsck physically being done on the console...

As far as the inability to to redirect the output, I am also perplexed. I just logged the SSH session and attached the output.

I understand your analogy, I am hoping that is the issue and i got real unlucky, consecutively. Is there no better log file that may have caught any of the bizarre activity than /var/log/messages?

Posted: **Fri Nov 17, 2017 2:55 pm**

I agree that it's an unusual practice, but I would have thought the init sequence would have taken care of the shutdown process. You aren't just cutting power to it. It's possible, of course, there was some accidental or malicious change to your inits that would cause this. That said, a safer thing to do if you want to reboot it at night is shutdown the processes first:

# service nagios stop
# service ndo2db stop
# service mysqld stop
# service crond stop
# service httpd stop

Again, I agree that RAM numbers can be misleading due to caching, but it may be that there is some check that has a memory leak. I see that you are on CentOS 6.8 rather than 6.9. I haven't gone through the entire CentOS changelog, but you may want to think about updating to 6.9 if you are having memory concerns. The same logic applies to your plugins, although that's harder to achieve if you have a lot of third-party plugins.

One thing I notice is that you have 300 hosts, but only 200 services. Is that correct?

You've run the db repair script and the logs in the profile make it clear that didn't work. There are some other repair tools though. Please run through https://assets.nagios.com/downloads/nag ... tabase.pdf and report any errors (skipping the repair script).

Regardless of if anything gets fixed in the db, it's clear you have some other issues which the following commands should resolve.

Regarding the instructions below, if you do not have killall, you can install it via the following command:
# yum install psmisc

If psmisc is not in your repos, then instead you can check to make sure nagios is not running with
# ps -aef | grep nagios

Please note that you do not need to run the above ps command unless you don't have killall. It's one or the other. In either case, there should be no nagios processes. The service nagios stop command *should* stop all processes. However, the fact that you are contacting support suggests your system is in an unstable state, so the killall command is there just to make sure nothing weird is hanging around.

If the above document does not resolve your issue, please run the following commands in order and report any errors. You ***must*** use mariadb instead of mysqld in the commands below, ***if*** you have mariadb.
# service nagios stop
# service ndo2db stop
# service mysqld stop
# service crond stop
# service httpd stop
# killall -9 nagios
# killall -9 ndo2db
# rm -f /usr/local/nagios/var/rw/nagios.cmd
# rm -f /usr/local/nagios/var/nagios.lock
# rm -f /usr/local/nagios/var/ndo.sock
# rm -f /usr/local/nagios/var/ndo2db.lock
# rm -f /usr/local/nagiosxi/var/reconfigure_nagios.lock
# for i in `ipcs -q | grep nagios | awk '{print $2}'`; do ipcrm -q $i; done
# service mysqld start
# service ndo2db start
# service nagios start
# service httpd start
# service crond start

Should multiple kernel queues and nagios processes make your system unreachable...no. However, they are problems that should be addressed.

Posted: **Mon Nov 20, 2017 1:22 pm**

@dwhitfield I appreciate you looking into this. I attached the entire putty session of me running those process in the link you posted as well as the following commands you had me run.

I guess we just wait to see it how it runs?

Posted: **Mon Nov 20, 2017 3:11 pm**

I don't think any of those ran after you man of killall since you left in the #, which just signifies a new line.

I've edited out the # so you can copy/paste more easily.

Code: Select all

service nagios stop
service ndo2db stop
service mysqld stop
service crond stop
service httpd stop
killall -9 nagios
killall -9 ndo2db
rm -f /usr/local/nagios/var/rw/nagios.cmd
rm -f /usr/local/nagios/var/nagios.lock
rm -f /usr/local/nagios/var/ndo.sock
rm -f /usr/local/nagios/var/ndo2db.lock
rm -f /usr/local/nagiosxi/var/reconfigure_nagios.lock
for i in `ipcs -q | grep nagios | awk '{print $2}'`; do ipcrm -q $i; done
service mysqld start
service ndo2db start
service nagios start
service httpd start
service crond start

Posted: **Mon Nov 20, 2017 4:29 pm**

Wow, that was just bad copy & pasting on my part. Embarassing

...

Code: Select all

[root@NAGIOS ~]# killall -9 nagios
nagios: no process killed
[root@NAGIOS ~]# killall -9 ndo2db
ndo2db: no process killed
[root@NAGIOS ~]# rm -f /usr/local/nagios/var/rw/nagios.cmd
[root@NAGIOS ~]# rm -f /usr/local/nagios/var/nagios.lock
[root@NAGIOS ~]# rm -f /usr/local/nagios/var/ndo.sock
[root@NAGIOS ~]# rm -f /usr/local/nagios/var/ndo2db.lock
[root@NAGIOS ~]# rm -f /usr/local/nagiosxi/var/reconfigure_nagios.lock
[root@NAGIOS ~]# for i in `ipcs -q | grep nagios | awk '{print $2}'`; do ipcrm -q $i; done
[root@NAGIOS ~]# service mysqld start
Starting mysqld:                                           [  OK  ]
[root@NAGIOS ~]# service ndo2db start
Starting ndo2db: done.
[root@NAGIOS ~]# service nagios start
Starting nagios: done.
[root@NAGIOS ~]# service httpd start
Starting httpd:                                            [  OK  ]
[root@NAGIOS ~]# service crond start
Starting crond:                                            [  OK  ]
[root@NAGIOS ~]#

Nagios Support Forum

CentOS Completely Bricking

CentOS Completely Bricking

Re: CentOS Completely Bricking

Re: CentOS Completely Bricking

Re: CentOS Completely Bricking

Re: CentOS Completely Bricking

Re: CentOS Completely Bricking

Re: CentOS Completely Bricking

Re: CentOS Completely Bricking

Re: CentOS Completely Bricking

Re: CentOS Completely Bricking