CentOS Completely Bricking
Posted: Thu Nov 16, 2017 11:13 am
Hey Gents,
I am hoping you can help me with an issue. I have had Nagios XI running for about a year with zero issues concerning the actual operating system, until recently. To give you a rundown, you guys have helped me with several different discussions on my Database problems, this may or may not be related.
Recently, I have not done any OS updates, or Nagios XI updates. At the time of this incident my setup at a glance is below:
Nagios XI 5.4.8
CentoOS 6.8
VMware
FYI, I have crontab rebooting the box everyday at 2am (shutdown -r now)
No, I do not have anything monitoring Nagios XI box. I realize best practice is to have a Nagios Core monitoring my Nagios XI
ISSUE
Last Sunday, for the first time ever, the OS was completely bricked. I couldn't ping it, or even KVM into it. Total Brick. This had never happened before, I just killed the machine in VMware and started it back up. As a precaution, I cloned the machine then I ran system updates "yum update" in hopes of quick resolution. However, now I have a problem of it happened again last night, thus questioning my confidence in it. I was at home so I did a quick kill in VMware, then booted it up. It came up without any errors. However, the database had errors so I ran the repair_database script. It completed within 2 minutes or so but bricked about 30 seconds after completing the repair. I am having trouble finding what to look at to pinpoint the issue. For production purposes I disconnected that box from the network and used the clone as my primary. Theoretically, I could replicate that issue if need be; run the script and brick it.
Edit I did re-run the script on the old box with a snapshot. While script does complete and it doesn't brick. I can IP it but when I try to access nagios web interface is redirects me to the Nagios Installer webpage /nagiosxi/install.php.
Looking at /var/log/messages I hoped to find an issue or something pointing me in the right direction. However, it just stops logging....
I have attached a large piece of the log file, before and after. Here is a glimpse of what I am seeing:
At line 245 Nov 15 22:28:23 NAGIOS ndo2db: Trimming eventhandlers. is the last thing logged.
At line 246 Nov 15 22:47:21 NAGIOS kernel: imklog 5.8.10, log source = /proc/kmsg started.
I expected to see a failure of some sort. Instead the last log insert is 22:28 and the 22:47 is me booting it up. The way I see it, there are 2 large variables, the OS and Nagios. Today, I reverted to a time prior to the OS system updates described and then updated Nagios XI from 5.4.8 to 5.4.11 which I am currently sitting at hoping it doesn't brick again.
Attached is the large portion of /var/log/messages. Please let me know if there is anything else that would help...I obviously have console access to the old box still and can get anything off of it, or test anything in it, or the current Nagios XI. If any file or info request please specify "old" or "current/new" Nagios box.
I am hoping you can help me with an issue. I have had Nagios XI running for about a year with zero issues concerning the actual operating system, until recently. To give you a rundown, you guys have helped me with several different discussions on my Database problems, this may or may not be related.
Recently, I have not done any OS updates, or Nagios XI updates. At the time of this incident my setup at a glance is below:
Nagios XI 5.4.8
CentoOS 6.8
VMware
FYI, I have crontab rebooting the box everyday at 2am (shutdown -r now)
No, I do not have anything monitoring Nagios XI box. I realize best practice is to have a Nagios Core monitoring my Nagios XI
ISSUE
Last Sunday, for the first time ever, the OS was completely bricked. I couldn't ping it, or even KVM into it. Total Brick. This had never happened before, I just killed the machine in VMware and started it back up. As a precaution, I cloned the machine then I ran system updates "yum update" in hopes of quick resolution. However, now I have a problem of it happened again last night, thus questioning my confidence in it. I was at home so I did a quick kill in VMware, then booted it up. It came up without any errors. However, the database had errors so I ran the repair_database script. It completed within 2 minutes or so but bricked about 30 seconds after completing the repair. I am having trouble finding what to look at to pinpoint the issue. For production purposes I disconnected that box from the network and used the clone as my primary. Theoretically, I could replicate that issue if need be; run the script and brick it.
Edit I did re-run the script on the old box with a snapshot. While script does complete and it doesn't brick. I can IP it but when I try to access nagios web interface is redirects me to the Nagios Installer webpage /nagiosxi/install.php.
Looking at /var/log/messages I hoped to find an issue or something pointing me in the right direction. However, it just stops logging....
I have attached a large piece of the log file, before and after. Here is a glimpse of what I am seeing:
At line 245 Nov 15 22:28:23 NAGIOS ndo2db: Trimming eventhandlers. is the last thing logged.
At line 246 Nov 15 22:47:21 NAGIOS kernel: imklog 5.8.10, log source = /proc/kmsg started.
I expected to see a failure of some sort. Instead the last log insert is 22:28 and the 22:47 is me booting it up. The way I see it, there are 2 large variables, the OS and Nagios. Today, I reverted to a time prior to the OS system updates described and then updated Nagios XI from 5.4.8 to 5.4.11 which I am currently sitting at hoping it doesn't brick again.
Attached is the large portion of /var/log/messages. Please let me know if there is anything else that would help...I obviously have console access to the old box still and can get anything off of it, or test anything in it, or the current Nagios XI. If any file or info request please specify "old" or "current/new" Nagios box.