NagiosXI Zombie process troubles

ejmorrow · Post by **ejmorrow** » Thu Aug 17, 2017 7:32 am

Hello,

EDIT**: Nagios is running on Redhat Enterprise 6.9 64bit (Manual Install)

We are currently on NagiosXI v5.4.8. Over the weekend we started having some
issues. The Nagios checks, both Active/Passive are very sporadic. Some of the active
checks haven't gone through in days.

When looking through the webgui everything looks normal under "process info" and
"performance", other than in performance instead of showing a few thousand
checks taking place every 5 minutes, it shows less than a thousand in
15 minutes.

Nothing unusual appears in the logs, even though the nagios processes that
remain seem really busy. When I do an strace I'm getting the following over and
over again. Different processes are doing the same thing, but with different
host/services which explains why they're busy and causing a high load on the system.

write(3, "job_id=209\0type=1\0command=/usr/bin/php
/usr/local/nagiosxi/scripts/handle_nagioscore_notification.php --notification-
type=service --contact=\"<ommited>\" --contactemail=\"<ommited>\" --
type=RECOVERY
--escalated=\"0\" --author=\"\" --comments=\"\" --
host=\"<ommited>\" --hostaddress=\"<ommited>\" --
hostalias=\"esappj223.uits.iupui.edu\" --
hostdisplayname=\"<ommited>\" --service=\"Apache Busy Workers\"
--hoststate=UP --hoststateid=0 --servicestate=OK --servicestateid=0 --
lastservi"..., 1085) = -1 EAGAIN (Resource temporarily unavailable)

write(3, "job_id=209\0type=1\0command=/usr/bin/php
/usr/local/nagiosxi/scripts/handle_nagioscore_notification.php --notification-
type=service --contact=\"<ommited>\" --contactemail=\"<ommited>\" --
type=RECOVERY
--escalated=\"0\" --author=\"\" --comments=\"\" --
host=\"<ommited>\" --hostaddress=\"<ommited>\" --
hostalias=\"<ommited>\" --
hostdisplayname=\"<ommited>\" --service=\"Apache Busy Workers\"
--hoststate=UP --hoststateid=0 --servicestate=OK --servicestateid=0 --
lastservi"..., 1085) = -1 EAGAIN (Resource temporarily unavailable)

Thanks,

Eric

bolson · Post by **bolson** » Thu Aug 17, 2017 11:06 am

Please execute the following from the command line on you XI server and post the result inside Code tags:

Code: Select all

tail -n 200 /var/log/messages

ejmorrow · Post by **ejmorrow** » Thu Aug 17, 2017 1:47 pm

I attached the log output to the original post.

Thanks,

Eric

Post by **tgriep** » Thu Aug 17, 2017 4:41 pm

Can you post the following files from the Nagios server?

Code: Select all

/use/local/nagios/etc/nagios.cfg
/etc/xinetd.d/nrpe

Run this as root and post the output.

Code: Select all

ps -ef --cols=300

I saw a PHP timeout error and that error could be resolved by following the instructions in this KB article.
https://support.nagios.com/kb/article/n ... ables.html

The NRPE Fail messages, you may want to look as the system trying to connect to the XI server and see what is causing that.

ejmorrow · Post by **ejmorrow** » Fri Aug 18, 2017 7:33 am

I already fixed the NRPE issue. The only_from statement was outside the service block.

Thanks,

Eric

bolson · Post by **bolson** » Fri Aug 18, 2017 9:19 am

May we close this topic?

ejmorrow · Post by **ejmorrow** » Fri Aug 18, 2017 9:27 am

No, because Nagios still isn't working right.

Thanks,

Eric

bolson · Post by **bolson** » Fri Aug 18, 2017 9:42 am

Please tell us how many hosts and service checks you are running. Also is your NagiosXI server a VM or is it running on physical hardware? What does the host have for CPU and memory resources? Also please run the following from the command line and attach to.txt:

top -an 1 -b > top.txt

ejmorrow · Post by **ejmorrow** » Fri Aug 18, 2017 9:57 am

We're at about 1500 hosts, and 22,000 service checks. 18,000 of those service checks are passive checks that report into NRDP.

Nagios is hosted on a VM server with 4 CPUs and 64GB of memory. It has been up and running for a little over a year and a half, and it's been running about 6 months with the current amount of checks.

Thanks,

Eric

bolson · Post by **bolson** » Fri Aug 18, 2017 10:28 am

This could be a resource issue... memory usage is fine. CPU is high enough that spikes could be causing the PHP failures. I also suspect drive IO issues. Could you please run the following commands and post the results:

Code: Select all

dd if=/dev/zero of=./dd.out bs=1M count=4000
dd if=./dd.out of=/dev/null

You can then delete dd.out

Also, if you're not reserving CPU resources on your esxi host, your performance could be impacted by other VMs running on your esxi host. We recommend running NagiosXI on physical hardware for an installation as large as yours and also utilizing a ramdisk to improve IO performance. If you haven't already done this, this document will guide you through the process.

https://assets.nagios.com/downloads/nag ... giosXI.pdf

Nagios Support Forum

NagiosXI Zombie process troubles

NagiosXI Zombie process troubles

Re: NagiosXI Zombie process troubles

Re: NagiosXI Zombie process troubles

Re: NagiosXI Zombie process troubles

Re: NagiosXI Zombie process troubles

Re: NagiosXI Zombie process troubles

Re: NagiosXI Zombie process troubles

Re: NagiosXI Zombie process troubles

Re: NagiosXI Zombie process troubles

Re: NagiosXI Zombie process troubles