Hey everyone,
Recently our Nagios XI system has become unresponsive randomly. The CPU usage gets pegged at 100% and we cannot even reach the web interface of the system and Nagios stops all monitoring of systems and does not send emails anymore. When we reboot Nagios, the system comes back up and functions normally for several days before showing the same symptoms.
What could be causing this?
Nagios Details:
CentOS 5 32Bit
Running on VMware vSphere 5.5
PS. Linux Newbie here!
Nagios XI hanging at 100% CPU usage
-
dwhitfield
- Former Nagios Staff
- Posts: 4583
- Joined: Wed Sep 21, 2016 10:29 am
- Location: NoLo, Minneapolis, MN
- Contact:
Re: Nagios XI hanging at 100% CPU usage
First off, what's the output of the following? Please note that you will need to remove the "#" before each command line for the commands to work.
# df -i
# df -h
I suspect you are out of space. If you are, either resize the disk or clear out some space.
Once any space issue is cleaned up, you are going to need to repair the db. Please run through https://assets.nagios.com/downloads/nag ... tabase.pdf and report any errors. If you stop at any point, please know at which point you stop.
If the repair script and other instructions in the document do not work, please continue.
Regarding the instructions below, if you do not have killall, you can install it via the following command:
# yum install psmisc
If psmisc is not in your repos, then instead you can check to make sure nagios is not running with
# ps -aef | grep nagios
If that document does not resolve your issue, please run the following commands in order and report any errors. You ***must*** use mariadb instead of mysqld in the commands below, ***if*** you have mariadb.
# service nagios stop
# service ndo2db stop
# service mysqld stop
# service crond stop
# service httpd stop
# killall -9 nagios
# killall -9 ndo2db
# rm -rf /usr/local/nagios/var/rw/nagios.cmd
# rm -rf /usr/local/nagios/var/nagios.lock
# rm -f /usr/local/nagios/var/ndo.sock
# rm -f /usr/local/nagios/var/ndo2db.lock
# rm -f /usr/local/nagiosxi/var/reconfigure_nagios.lock
# for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
# service ndo2db start
# service nagios start
# service mysqld start
# service crond start
# service httpd start
Assuming you can at least get to the web interface at this point, can you PM me your Profile? You can download it by going to Admin > System Config > System Profile and click the ***Download Profile*** button towards the top. If for whatever reason you *cannot* download the profile, please put the output of View System Info (5.3.4+, Show Profile if older) in the thread (that will at least get us some info). This will give us access to many of the logs we would otherwise ask for individually. If security is a concern, you can unzip the profile take out what you like, and then zip it up again. We may end up needing something you remove, but we can ask for that specifically.
After you PM the profile, please update this thread. Updating this thread is the only way for it to show back up on our dashboard.
# df -i
# df -h
I suspect you are out of space. If you are, either resize the disk or clear out some space.
Once any space issue is cleaned up, you are going to need to repair the db. Please run through https://assets.nagios.com/downloads/nag ... tabase.pdf and report any errors. If you stop at any point, please know at which point you stop.
If the repair script and other instructions in the document do not work, please continue.
Regarding the instructions below, if you do not have killall, you can install it via the following command:
# yum install psmisc
If psmisc is not in your repos, then instead you can check to make sure nagios is not running with
# ps -aef | grep nagios
If that document does not resolve your issue, please run the following commands in order and report any errors. You ***must*** use mariadb instead of mysqld in the commands below, ***if*** you have mariadb.
# service nagios stop
# service ndo2db stop
# service mysqld stop
# service crond stop
# service httpd stop
# killall -9 nagios
# killall -9 ndo2db
# rm -rf /usr/local/nagios/var/rw/nagios.cmd
# rm -rf /usr/local/nagios/var/nagios.lock
# rm -f /usr/local/nagios/var/ndo.sock
# rm -f /usr/local/nagios/var/ndo2db.lock
# rm -f /usr/local/nagiosxi/var/reconfigure_nagios.lock
# for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
# service ndo2db start
# service nagios start
# service mysqld start
# service crond start
# service httpd start
Assuming you can at least get to the web interface at this point, can you PM me your Profile? You can download it by going to Admin > System Config > System Profile and click the ***Download Profile*** button towards the top. If for whatever reason you *cannot* download the profile, please put the output of View System Info (5.3.4+, Show Profile if older) in the thread (that will at least get us some info). This will give us access to many of the logs we would otherwise ask for individually. If security is a concern, you can unzip the profile take out what you like, and then zip it up again. We may end up needing something you remove, but we can ask for that specifically.
After you PM the profile, please update this thread. Updating this thread is the only way for it to show back up on our dashboard.
Re: Nagios XI hanging at 100% CPU usage
Thanks for the reply. I see /boot and /dev/sda1 are all less than 34% used (that's the max used) so I did not free up any space. I went ahead and ran the DB repair successfully.
Sorry, I did not see an option to PM you the profile (which I have downloaded and sanitized). How can I PM that to you?
Also, is there any way I can find the Nagios Customer/Account number? I am actually a Nagios customer but unable to find the account number.
Thanks again!
Sorry, I did not see an option to PM you the profile (which I have downloaded and sanitized). How can I PM that to you?
Also, is there any way I can find the Nagios Customer/Account number? I am actually a Nagios customer but unable to find the account number.
Thanks again!
-
dwhitfield
- Former Nagios Staff
- Posts: 4583
- Joined: Wed Sep 21, 2016 10:29 am
- Location: NoLo, Minneapolis, MN
- Contact:
Re: Nagios XI hanging at 100% CPU usage
Did you also run through the following?
# service nagios stop
# service ndo2db stop
# service mysqld stop
# service crond stop
# service httpd stop
# killall -9 nagios
# killall -9 ndo2db
# rm -rf /usr/local/nagios/var/rw/nagios.cmd
# rm -rf /usr/local/nagios/var/nagios.lock
# rm -f /usr/local/nagios/var/ndo.sock
# rm -f /usr/local/nagios/var/ndo2db.lock
# rm -f /usr/local/nagiosxi/var/reconfigure_nagios.lock
# for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
# service ndo2db start
# service nagios start
# service mysqld start
# service crond start
# service httpd start
That said, the best thing to do at this point if you are a customer is email [email protected]. If you get a bounce back, you'll need to email [email protected] to see what is up. If someone else set up the account, it's possible you just never got added as an approved sender.
# service nagios stop
# service ndo2db stop
# service mysqld stop
# service crond stop
# service httpd stop
# killall -9 nagios
# killall -9 ndo2db
# rm -rf /usr/local/nagios/var/rw/nagios.cmd
# rm -rf /usr/local/nagios/var/nagios.lock
# rm -f /usr/local/nagios/var/ndo.sock
# rm -f /usr/local/nagios/var/ndo2db.lock
# rm -f /usr/local/nagiosxi/var/reconfigure_nagios.lock
# for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
# service ndo2db start
# service nagios start
# service mysqld start
# service crond start
# service httpd start
That said, the best thing to do at this point if you are a customer is email [email protected]. If you get a bounce back, you'll need to email [email protected] to see what is up. If someone else set up the account, it's possible you just never got added as an approved sender.
Re: Nagios XI hanging at 100% CPU usage
Thanks, I ran those commands and PM'ed you the system profile information. I had to restart the Nagios server to begin with so I am not sure if the problem is resolved or not as it takes up to a week to re-appear. But if you notice anything else that should be fixed, please do let me know. thanks!
- tacolover101
- Posts: 432
- Joined: Mon Apr 10, 2017 11:55 am
Re: Nagios XI hanging at 100% CPU usage
how many hosts / services do you have, and what amount of CPU / memory allocated?
-
dwhitfield
- Former Nagios Staff
- Posts: 4583
- Joined: Wed Sep 21, 2016 10:29 am
- Location: NoLo, Minneapolis, MN
- Contact:
Re: Nagios XI hanging at 100% CPU usage
Thanks @tacolover101! This is one of the questions that would have come through in the profile, had I gotten it. I sent a PM about it not coming through.
Re: Nagios XI hanging at 100% CPU usage
Thanks guys. I re-PMed the profile again.
We are monitoring 146 hosts and about 840 services total. The system has 4 vCPUs and 3 GB of memory. Does this meet the system requirements?
We are monitoring 146 hosts and about 840 services total. The system has 4 vCPUs and 3 GB of memory. Does this meet the system requirements?
-
dwhitfield
- Former Nagios Staff
- Posts: 4583
- Joined: Wed Sep 21, 2016 10:29 am
- Location: NoLo, Minneapolis, MN
- Contact:
Re: Nagios XI hanging at 100% CPU usage
That's definitely under powered on the RAM side: https://assets.nagios.com/downloads/nag ... ements.pdf . Can you add more RAM and report back?
Thanks for the profile. I shared it with the other techs.
Can you post your /etc/my.cnf? It looks like you might need to increase your mass connections.
Thanks for the profile. I shared it with the other techs.
Can you post your /etc/my.cnf? It looks like you might need to increase your mass connections.
Re: Nagios XI hanging at 100% CPU usage
Thanks. Bumped up the memory to 6GB from 3GB. Will monitor for about a week or so and see if the issue re-occurs.
When I try to navigate to /etc/my.cnf I get a permission denied even as root?
When I try to navigate to /etc/my.cnf I get a permission denied even as root?