Page 1 of 1

Service checks not working. How to determine what is hung

Posted: Mon Aug 12, 2019 3:31 pm
by awilson
Hi. One of our Nagios servers jumped from a CPU Load in the less than 2 range for 1 min, 5 min, and 15 min to
Load Critical 18h 28m 49s 16/16 2019-08-12 13:38:30 Load Critical: load1=24.72, load5=24.71, load15=24.97.

iowait is at 0.
CPU Stats Ok 18h 4m 29s 1/16 2019-08-12 13:39:44 CPU STATISTICS OK : user=30.36% system=69.64% iowait=0.00% idle=0.00% nice=0.00% steal=0.00%

We missed a downed server in that period because the server availability ping check didn't execute.

This is an RHEL 6 VM with 12 CPUs and 32 GB of memory.

While writing this, I checked the mysqld.log and noticed hours of db error messages, so I ran the repairdatabases script. The cpu load is still high. The long-running processes are still there. Should we kill them?

Thanks!

Re: Service checks not working. How to determine what is hun

Posted: Mon Aug 12, 2019 3:48 pm
by scottwilkerson
with 5575 tasks running on the server something definitely isn't right

Can you show the output of the following:

Code: Select all

ps -ef|grep nagios.cfg

Re: Service checks not working. How to determine what is hun

Posted: Mon Aug 12, 2019 4:39 pm
by awilson
[nagios@server ~]$ ps -ef | grep nagios.cfg
nagios 718 1 0 Jun10 ? 00:01:11 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 929 718 0 Jun10 ? 00:04:19 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 9987 8241 0 16:39 pts/0 00:00:00 grep nagios.cfg
nagios 12286 1 0 Jun10 ? 00:00:23 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 12482 12286 0 Jun10 ? 00:04:18 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 20404 1 3 13:53 ? 00:05:48 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 20588 20404 0 13:53 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 32104 1 16 Aug09 ? 13:11:40 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 32283 32104 0 Aug09 ? 00:00:12 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
[nagios@server ~]$

Thx

Re: Service checks not working. How to determine what is hun

Posted: Mon Aug 12, 2019 4:45 pm
by scottwilkerson
Ok, you have multiple nagios parent processes which is definately a problem

Lets run th efollowing

Code: Select all

killall -9 nagios
service nagios start

Re: Service checks not working. How to determine what is hun

Posted: Mon Aug 12, 2019 4:58 pm
by awilson
We have this now

[nagios@server ~]$ ps -ef | grep nagios.cfg
nagios 3254 1 14 16:55 ? 00:00:02 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 3304 3254 0 16:55 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 4802 8241 0 16:55 pts/0 00:00:00 grep nagios.cfg

Load Critical: load1=5.08, load5=17.19, load15=22.51

The 1 minute has improved quite a bit. The others should drop over time. Thanks.

What should we look for as a cause?

Thanks!

Re: Service checks not working. How to determine what is hun

Posted: Mon Aug 12, 2019 5:00 pm
by scottwilkerson
awilson wrote: The 1 minute has improved quite a bit. The others should drop over time. Thanks.

What should we look for as a cause?

Thanks!
There is a host of possibilities, but once the database was messed up anything is possible.

If this is a VM, I would start looking at storage or a possible disk failure.

Re: Service checks not working. How to determine what is hun

Posted: Mon Aug 12, 2019 5:02 pm
by awilson
Thanks. There were storage incidents overlapping the period.

I'll check there.

You can close this. Thank you very much for the late afternoon help! //smile

Re: Service checks not working. How to determine what is hun

Posted: Tue Aug 13, 2019 6:36 am
by scottwilkerson
awilson wrote:Thanks. There were storage incidents overlapping the period.

I'll check there.

You can close this. Thank you very much for the late afternoon help! //smile
Great!

Locking