NagiosXI Zombie process troubles

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
ejmorrow
Posts: 13
Joined: Fri May 13, 2016 9:02 am

NagiosXI Zombie process troubles

Post by ejmorrow »

Hello,

EDIT**: Nagios is running on Redhat Enterprise 6.9 64bit (Manual Install)

We are currently on NagiosXI v5.4.8. Over the weekend we started having some
issues. The Nagios checks, both Active/Passive are very sporadic. Some of the active
checks haven't gone through in days.

When looking through the webgui everything looks normal under "process info" and
"performance", other than in performance instead of showing a few thousand
checks taking place every 5 minutes, it shows less than a thousand in
15 minutes.

Nothing unusual appears in the logs, even though the nagios processes that
remain seem really busy. When I do an strace I'm getting the following over and
over again. Different processes are doing the same thing, but with different
host/services which explains why they're busy and causing a high load on the system.

write(3, "job_id=209\0type=1\0command=/usr/bin/php
/usr/local/nagiosxi/scripts/handle_nagioscore_notification.php --notification-
type=service --contact=\"<ommited>\" --contactemail=\"<ommited>\" --
type=RECOVERY
--escalated=\"0\" --author=\"\" --comments=\"\" --
host=\"<ommited>\" --hostaddress=\"<ommited>\" --
hostalias=\"esappj223.uits.iupui.edu\" --
hostdisplayname=\"<ommited>\" --service=\"Apache Busy Workers\"
--hoststate=UP --hoststateid=0 --servicestate=OK --servicestateid=0 --
lastservi"..., 1085) = -1 EAGAIN (Resource temporarily unavailable)

write(3, "job_id=209\0type=1\0command=/usr/bin/php
/usr/local/nagiosxi/scripts/handle_nagioscore_notification.php --notification-
type=service --contact=\"<ommited>\" --contactemail=\"<ommited>\" --
type=RECOVERY
--escalated=\"0\" --author=\"\" --comments=\"\" --
host=\"<ommited>\" --hostaddress=\"<ommited>\" --
hostalias=\"<ommited>\" --
hostdisplayname=\"<ommited>\" --service=\"Apache Busy Workers\"
--hoststate=UP --hoststateid=0 --servicestate=OK --servicestateid=0 --
lastservi"..., 1085) = -1 EAGAIN (Resource temporarily unavailable)

Thanks,

Eric
You do not have the required permissions to view the files attached to this post.
Last edited by ejmorrow on Thu Aug 17, 2017 11:35 am, edited 1 time in total.
bolson

Re: NagiosXI Zombie process troubles

Post by bolson »

Please execute the following from the command line on you XI server and post the result inside Code tags:

Code: Select all

tail -n 200 /var/log/messages
ejmorrow
Posts: 13
Joined: Fri May 13, 2016 9:02 am

Re: NagiosXI Zombie process troubles

Post by ejmorrow »

I attached the log output to the original post.

Thanks,

Eric
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: NagiosXI Zombie process troubles

Post by tgriep »

Can you post the following files from the Nagios server?

Code: Select all

/use/local/nagios/etc/nagios.cfg
/etc/xinetd.d/nrpe
Run this as root and post the output.

Code: Select all

ps -ef --cols=300
I saw a PHP timeout error and that error could be resolved by following the instructions in this KB article.
https://support.nagios.com/kb/article/n ... ables.html

The NRPE Fail messages, you may want to look as the system trying to connect to the XI server and see what is causing that.
Be sure to check out our Knowledgebase for helpful articles and solutions!
ejmorrow
Posts: 13
Joined: Fri May 13, 2016 9:02 am

Re: NagiosXI Zombie process troubles

Post by ejmorrow »

I already fixed the NRPE issue. The only_from statement was outside the service block.

Thanks,

Eric
You do not have the required permissions to view the files attached to this post.
bolson

Re: NagiosXI Zombie process troubles

Post by bolson »

May we close this topic?
ejmorrow
Posts: 13
Joined: Fri May 13, 2016 9:02 am

Re: NagiosXI Zombie process troubles

Post by ejmorrow »

No, because Nagios still isn't working right.

Thanks,

Eric
bolson

Re: NagiosXI Zombie process troubles

Post by bolson »

Please tell us how many hosts and service checks you are running. Also is your NagiosXI server a VM or is it running on physical hardware? What does the host have for CPU and memory resources? Also please run the following from the command line and attach to.txt:

top -an 1 -b > top.txt
ejmorrow
Posts: 13
Joined: Fri May 13, 2016 9:02 am

Re: NagiosXI Zombie process troubles

Post by ejmorrow »

We're at about 1500 hosts, and 22,000 service checks. 18,000 of those service checks are passive checks that report into NRDP.

Nagios is hosted on a VM server with 4 CPUs and 64GB of memory. It has been up and running for a little over a year and a half, and it's been running about 6 months with the current amount of checks.

Thanks,

Eric
You do not have the required permissions to view the files attached to this post.
bolson

Re: NagiosXI Zombie process troubles

Post by bolson »

This could be a resource issue... memory usage is fine. CPU is high enough that spikes could be causing the PHP failures. I also suspect drive IO issues. Could you please run the following commands and post the results:

Code: Select all

dd if=/dev/zero of=./dd.out bs=1M count=4000
dd if=./dd.out of=/dev/null
You can then delete dd.out

Also, if you're not reserving CPU resources on your esxi host, your performance could be impacted by other VMs running on your esxi host. We recommend running NagiosXI on physical hardware for an installation as large as yours and also utilizing a ramdisk to improve IO performance. If you haven't already done this, this document will guide you through the process.

https://assets.nagios.com/downloads/nag ... giosXI.pdf
Locked