NagiosXI Zombie process troubles
NagiosXI Zombie process troubles
Hello,
EDIT**: Nagios is running on Redhat Enterprise 6.9 64bit (Manual Install)
We are currently on NagiosXI v5.4.8. Over the weekend we started having some
issues. The Nagios checks, both Active/Passive are very sporadic. Some of the active
checks haven't gone through in days.
When looking through the webgui everything looks normal under "process info" and
"performance", other than in performance instead of showing a few thousand
checks taking place every 5 minutes, it shows less than a thousand in
15 minutes.
Nothing unusual appears in the logs, even though the nagios processes that
remain seem really busy. When I do an strace I'm getting the following over and
over again. Different processes are doing the same thing, but with different
host/services which explains why they're busy and causing a high load on the system.
write(3, "job_id=209\0type=1\0command=/usr/bin/php
/usr/local/nagiosxi/scripts/handle_nagioscore_notification.php --notification-
type=service --contact=\"<ommited>\" --contactemail=\"<ommited>\" --
type=RECOVERY
--escalated=\"0\" --author=\"\" --comments=\"\" --
host=\"<ommited>\" --hostaddress=\"<ommited>\" --
hostalias=\"esappj223.uits.iupui.edu\" --
hostdisplayname=\"<ommited>\" --service=\"Apache Busy Workers\"
--hoststate=UP --hoststateid=0 --servicestate=OK --servicestateid=0 --
lastservi"..., 1085) = -1 EAGAIN (Resource temporarily unavailable)
write(3, "job_id=209\0type=1\0command=/usr/bin/php
/usr/local/nagiosxi/scripts/handle_nagioscore_notification.php --notification-
type=service --contact=\"<ommited>\" --contactemail=\"<ommited>\" --
type=RECOVERY
--escalated=\"0\" --author=\"\" --comments=\"\" --
host=\"<ommited>\" --hostaddress=\"<ommited>\" --
hostalias=\"<ommited>\" --
hostdisplayname=\"<ommited>\" --service=\"Apache Busy Workers\"
--hoststate=UP --hoststateid=0 --servicestate=OK --servicestateid=0 --
lastservi"..., 1085) = -1 EAGAIN (Resource temporarily unavailable)
Thanks,
Eric
EDIT**: Nagios is running on Redhat Enterprise 6.9 64bit (Manual Install)
We are currently on NagiosXI v5.4.8. Over the weekend we started having some
issues. The Nagios checks, both Active/Passive are very sporadic. Some of the active
checks haven't gone through in days.
When looking through the webgui everything looks normal under "process info" and
"performance", other than in performance instead of showing a few thousand
checks taking place every 5 minutes, it shows less than a thousand in
15 minutes.
Nothing unusual appears in the logs, even though the nagios processes that
remain seem really busy. When I do an strace I'm getting the following over and
over again. Different processes are doing the same thing, but with different
host/services which explains why they're busy and causing a high load on the system.
write(3, "job_id=209\0type=1\0command=/usr/bin/php
/usr/local/nagiosxi/scripts/handle_nagioscore_notification.php --notification-
type=service --contact=\"<ommited>\" --contactemail=\"<ommited>\" --
type=RECOVERY
--escalated=\"0\" --author=\"\" --comments=\"\" --
host=\"<ommited>\" --hostaddress=\"<ommited>\" --
hostalias=\"esappj223.uits.iupui.edu\" --
hostdisplayname=\"<ommited>\" --service=\"Apache Busy Workers\"
--hoststate=UP --hoststateid=0 --servicestate=OK --servicestateid=0 --
lastservi"..., 1085) = -1 EAGAIN (Resource temporarily unavailable)
write(3, "job_id=209\0type=1\0command=/usr/bin/php
/usr/local/nagiosxi/scripts/handle_nagioscore_notification.php --notification-
type=service --contact=\"<ommited>\" --contactemail=\"<ommited>\" --
type=RECOVERY
--escalated=\"0\" --author=\"\" --comments=\"\" --
host=\"<ommited>\" --hostaddress=\"<ommited>\" --
hostalias=\"<ommited>\" --
hostdisplayname=\"<ommited>\" --service=\"Apache Busy Workers\"
--hoststate=UP --hoststateid=0 --servicestate=OK --servicestateid=0 --
lastservi"..., 1085) = -1 EAGAIN (Resource temporarily unavailable)
Thanks,
Eric
You do not have the required permissions to view the files attached to this post.
Last edited by ejmorrow on Thu Aug 17, 2017 11:35 am, edited 1 time in total.
-
bolson
Re: NagiosXI Zombie process troubles
Please execute the following from the command line on you XI server and post the result inside Code tags:
Code: Select all
tail -n 200 /var/log/messagesRe: NagiosXI Zombie process troubles
I attached the log output to the original post.
Thanks,
Eric
Thanks,
Eric
Re: NagiosXI Zombie process troubles
Can you post the following files from the Nagios server?
Run this as root and post the output.
I saw a PHP timeout error and that error could be resolved by following the instructions in this KB article.
https://support.nagios.com/kb/article/n ... ables.html
The NRPE Fail messages, you may want to look as the system trying to connect to the XI server and see what is causing that.
Code: Select all
/use/local/nagios/etc/nagios.cfg
/etc/xinetd.d/nrpeCode: Select all
ps -ef --cols=300https://support.nagios.com/kb/article/n ... ables.html
The NRPE Fail messages, you may want to look as the system trying to connect to the XI server and see what is causing that.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: NagiosXI Zombie process troubles
I already fixed the NRPE issue. The only_from statement was outside the service block.
Thanks,
Eric
Thanks,
Eric
You do not have the required permissions to view the files attached to this post.
Re: NagiosXI Zombie process troubles
No, because Nagios still isn't working right.
Thanks,
Eric
Thanks,
Eric
-
bolson
Re: NagiosXI Zombie process troubles
Please tell us how many hosts and service checks you are running. Also is your NagiosXI server a VM or is it running on physical hardware? What does the host have for CPU and memory resources? Also please run the following from the command line and attach to.txt:
top -an 1 -b > top.txt
top -an 1 -b > top.txt
Re: NagiosXI Zombie process troubles
We're at about 1500 hosts, and 22,000 service checks. 18,000 of those service checks are passive checks that report into NRDP.
Nagios is hosted on a VM server with 4 CPUs and 64GB of memory. It has been up and running for a little over a year and a half, and it's been running about 6 months with the current amount of checks.
Thanks,
Eric
Nagios is hosted on a VM server with 4 CPUs and 64GB of memory. It has been up and running for a little over a year and a half, and it's been running about 6 months with the current amount of checks.
Thanks,
Eric
You do not have the required permissions to view the files attached to this post.
-
bolson
Re: NagiosXI Zombie process troubles
This could be a resource issue... memory usage is fine. CPU is high enough that spikes could be causing the PHP failures. I also suspect drive IO issues. Could you please run the following commands and post the results:
You can then delete dd.out
Also, if you're not reserving CPU resources on your esxi host, your performance could be impacted by other VMs running on your esxi host. We recommend running NagiosXI on physical hardware for an installation as large as yours and also utilizing a ramdisk to improve IO performance. If you haven't already done this, this document will guide you through the process.
https://assets.nagios.com/downloads/nag ... giosXI.pdf
Code: Select all
dd if=/dev/zero of=./dd.out bs=1M count=4000
dd if=./dd.out of=/dev/nullAlso, if you're not reserving CPU resources on your esxi host, your performance could be impacted by other VMs running on your esxi host. We recommend running NagiosXI on physical hardware for an installation as large as yours and also utilizing a ramdisk to improve IO performance. If you haven't already done this, this document will guide you through the process.
https://assets.nagios.com/downloads/nag ... giosXI.pdf