Service checks not working. How to determine what is hung

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
awilson
Posts: 224
Joined: Mon Mar 21, 2016 1:20 pm

Service checks not working. How to determine what is hung

Post by awilson »

Hi. One of our Nagios servers jumped from a CPU Load in the less than 2 range for 1 min, 5 min, and 15 min to
Load Critical 18h 28m 49s 16/16 2019-08-12 13:38:30 Load Critical: load1=24.72, load5=24.71, load15=24.97.

iowait is at 0.
CPU Stats Ok 18h 4m 29s 1/16 2019-08-12 13:39:44 CPU STATISTICS OK : user=30.36% system=69.64% iowait=0.00% idle=0.00% nice=0.00% steal=0.00%

We missed a downed server in that period because the server availability ping check didn't execute.

This is an RHEL 6 VM with 12 CPUs and 32 GB of memory.

While writing this, I checked the mysqld.log and noticed hours of db error messages, so I ran the repairdatabases script. The cpu load is still high. The long-running processes are still there. Should we kill them?

Thanks!
You do not have the required permissions to view the files attached to this post.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Service checks not working. How to determine what is hun

Post by scottwilkerson »

with 5575 tasks running on the server something definitely isn't right

Can you show the output of the following:

Code: Select all

ps -ef|grep nagios.cfg
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
awilson
Posts: 224
Joined: Mon Mar 21, 2016 1:20 pm

Re: Service checks not working. How to determine what is hun

Post by awilson »

[nagios@server ~]$ ps -ef | grep nagios.cfg
nagios 718 1 0 Jun10 ? 00:01:11 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 929 718 0 Jun10 ? 00:04:19 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 9987 8241 0 16:39 pts/0 00:00:00 grep nagios.cfg
nagios 12286 1 0 Jun10 ? 00:00:23 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 12482 12286 0 Jun10 ? 00:04:18 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 20404 1 3 13:53 ? 00:05:48 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 20588 20404 0 13:53 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 32104 1 16 Aug09 ? 13:11:40 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 32283 32104 0 Aug09 ? 00:00:12 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
[nagios@server ~]$

Thx
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Service checks not working. How to determine what is hun

Post by scottwilkerson »

Ok, you have multiple nagios parent processes which is definately a problem

Lets run th efollowing

Code: Select all

killall -9 nagios
service nagios start
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
awilson
Posts: 224
Joined: Mon Mar 21, 2016 1:20 pm

Re: Service checks not working. How to determine what is hun

Post by awilson »

We have this now

[nagios@server ~]$ ps -ef | grep nagios.cfg
nagios 3254 1 14 16:55 ? 00:00:02 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 3304 3254 0 16:55 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 4802 8241 0 16:55 pts/0 00:00:00 grep nagios.cfg

Load Critical: load1=5.08, load5=17.19, load15=22.51

The 1 minute has improved quite a bit. The others should drop over time. Thanks.

What should we look for as a cause?

Thanks!
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Service checks not working. How to determine what is hun

Post by scottwilkerson »

awilson wrote: The 1 minute has improved quite a bit. The others should drop over time. Thanks.

What should we look for as a cause?

Thanks!
There is a host of possibilities, but once the database was messed up anything is possible.

If this is a VM, I would start looking at storage or a possible disk failure.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
awilson
Posts: 224
Joined: Mon Mar 21, 2016 1:20 pm

Re: Service checks not working. How to determine what is hun

Post by awilson »

Thanks. There were storage incidents overlapping the period.

I'll check there.

You can close this. Thank you very much for the late afternoon help! //smile
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Service checks not working. How to determine what is hun

Post by scottwilkerson »

awilson wrote:Thanks. There were storage incidents overlapping the period.

I'll check there.

You can close this. Thank you very much for the late afternoon help! //smile
Great!

Locking
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Locked