Hi. One of our Nagios servers jumped from a CPU Load in the less than 2 range for 1 min, 5 min, and 15 min to
Load Critical 18h 28m 49s 16/16 2019-08-12 13:38:30 Load Critical: load1=24.72, load5=24.71, load15=24.97.
iowait is at 0.
CPU Stats Ok 18h 4m 29s 1/16 2019-08-12 13:39:44 CPU STATISTICS OK : user=30.36% system=69.64% iowait=0.00% idle=0.00% nice=0.00% steal=0.00%
We missed a downed server in that period because the server availability ping check didn't execute.
This is an RHEL 6 VM with 12 CPUs and 32 GB of memory.
While writing this, I checked the mysqld.log and noticed hours of db error messages, so I ran the repairdatabases script. The cpu load is still high. The long-running processes are still there. Should we kill them?
Thanks!
Service checks not working. How to determine what is hung
Service checks not working. How to determine what is hung
You do not have the required permissions to view the files attached to this post.
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Service checks not working. How to determine what is hun
with 5575 tasks running on the server something definitely isn't right
Can you show the output of the following:
Can you show the output of the following:
Code: Select all
ps -ef|grep nagios.cfgRe: Service checks not working. How to determine what is hun
[nagios@server ~]$ ps -ef | grep nagios.cfg
nagios 718 1 0 Jun10 ? 00:01:11 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 929 718 0 Jun10 ? 00:04:19 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 9987 8241 0 16:39 pts/0 00:00:00 grep nagios.cfg
nagios 12286 1 0 Jun10 ? 00:00:23 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 12482 12286 0 Jun10 ? 00:04:18 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 20404 1 3 13:53 ? 00:05:48 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 20588 20404 0 13:53 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 32104 1 16 Aug09 ? 13:11:40 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 32283 32104 0 Aug09 ? 00:00:12 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
[nagios@server ~]$
Thx
nagios 718 1 0 Jun10 ? 00:01:11 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 929 718 0 Jun10 ? 00:04:19 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 9987 8241 0 16:39 pts/0 00:00:00 grep nagios.cfg
nagios 12286 1 0 Jun10 ? 00:00:23 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 12482 12286 0 Jun10 ? 00:04:18 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 20404 1 3 13:53 ? 00:05:48 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 20588 20404 0 13:53 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 32104 1 16 Aug09 ? 13:11:40 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 32283 32104 0 Aug09 ? 00:00:12 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
[nagios@server ~]$
Thx
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Service checks not working. How to determine what is hun
Ok, you have multiple nagios parent processes which is definately a problem
Lets run th efollowing
Lets run th efollowing
Code: Select all
killall -9 nagios
service nagios startRe: Service checks not working. How to determine what is hun
We have this now
[nagios@server ~]$ ps -ef | grep nagios.cfg
nagios 3254 1 14 16:55 ? 00:00:02 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 3304 3254 0 16:55 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 4802 8241 0 16:55 pts/0 00:00:00 grep nagios.cfg
Load Critical: load1=5.08, load5=17.19, load15=22.51
The 1 minute has improved quite a bit. The others should drop over time. Thanks.
What should we look for as a cause?
Thanks!
[nagios@server ~]$ ps -ef | grep nagios.cfg
nagios 3254 1 14 16:55 ? 00:00:02 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 3304 3254 0 16:55 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 4802 8241 0 16:55 pts/0 00:00:00 grep nagios.cfg
Load Critical: load1=5.08, load5=17.19, load15=22.51
The 1 minute has improved quite a bit. The others should drop over time. Thanks.
What should we look for as a cause?
Thanks!
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Service checks not working. How to determine what is hun
There is a host of possibilities, but once the database was messed up anything is possible.awilson wrote: The 1 minute has improved quite a bit. The others should drop over time. Thanks.
What should we look for as a cause?
Thanks!
If this is a VM, I would start looking at storage or a possible disk failure.
Re: Service checks not working. How to determine what is hun
Thanks. There were storage incidents overlapping the period.
I'll check there.
You can close this. Thank you very much for the late afternoon help! //smile
I'll check there.
You can close this. Thank you very much for the late afternoon help! //smile
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Service checks not working. How to determine what is hun
Great!awilson wrote:Thanks. There were storage incidents overlapping the period.
I'll check there.
You can close this. Thank you very much for the late afternoon help! //smile
Locking