NagiosXI Zombie process troubles
-
bolson
Re: NagiosXI Zombie process troubles
In addition, you could run load, memory, and IO checks on the NagiosXI server itself to see if the php failures correspond with resource issues on your server.
Re: NagiosXI Zombie process troubles
We reboot our servers on Sunday. This Sunday when we rebooted it looks like there was a large memory leak or something else that exhausted the memory. The server ended up becoming unresponsive so we rebooted it. Afterwards there was some complaints about msyql tables. So we ran the /usr/local/nagiosxi/scripts/repair_databases.sh which seemed to clear up those issues, but we were running into the issues we're currently having.
I noticed that even with Nagios shutdown the postmaster service remains busy. Looking into it, it looks like it's cleaning up data. Specifically it's doing this non-stop, and the id's are in the millions.
DELETE FROM xi_meta WHERE meta_id = xxxxxxxx
Would it be safe to drop xi_meta table and run repair_databases.sh again, cause I'm not seeing an end in sight for this?
Thanks,
Eric
I noticed that even with Nagios shutdown the postmaster service remains busy. Looking into it, it looks like it's cleaning up data. Specifically it's doing this non-stop, and the id's are in the millions.
DELETE FROM xi_meta WHERE meta_id = xxxxxxxx
Would it be safe to drop xi_meta table and run repair_databases.sh again, cause I'm not seeing an end in sight for this?
Thanks,
Eric
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: NagiosXI Zombie process troubles
The offending table would be xi_events however, you shouldn't drop the table but you could truncate it without causing harmejmorrow wrote:We reboot our servers on Sunday. This Sunday when we rebooted it looks like there was a large memory leak or something else that exhausted the memory. The server ended up becoming unresponsive so we rebooted it. Afterwards there was some complaints about msyql tables. So we ran the /usr/local/nagiosxi/scripts/repair_databases.sh which seemed to clear up those issues, but we were running into the issues we're currently having.
I noticed that even with Nagios shutdown the postmaster service remains busy. Looking into it, it looks like it's cleaning up data. Specifically it's doing this non-stop, and the id's are in the millions.
DELETE FROM xi_meta WHERE meta_id = xxxxxxxx
Would it be safe to drop xi_meta table and run repair_databases.sh again, cause I'm not seeing an end in sight for this?
Thanks,
Eric
Additionally it it worth mentioning, this would not clear up messages already sent to the mail spool
Re: NagiosXI Zombie process troubles
The Zombie processes could also be caused by the external buffer slots option that is set very high in the nagios.cfg file.
Try changing it from
to
and see if that cuts down on the defunct processes.
You will have to restart the nagios process for the change to take effect.
If you still see defunct processes, remove that option all together.
Try changing it from
Code: Select all
external_command_buffer_slots=2048Code: Select all
external_command_buffer_slots=512You will have to restart the nagios process for the change to take effect.
If you still see defunct processes, remove that option all together.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: NagiosXI Zombie process troubles
It looks like it finally cleared up on its own after leaving Nagios off for a bit. I ran the dd command, and it came back with 250MBs.
I think the first post I made is the best place to really look at what is going wrong.
write(3, "job_id=213\0type=1\0command=/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_notification.php --notification-type=service --contact=\"<ommited>\" --contactemail=\"<ommited>\" --type=RECOVERY --escalated=\"0\" --author=\"\" --comments=\"\" --host=\"<ommited>\" --hostaddress=\"<ommited>\" --hostalias=\"<ommited\" --hostdisplayname=\"<ommited>\" --service=\"DiskIO\" --hoststate=UP --hoststateid=0 --servicestate=OK --servicestateid=0 --lastservicestate=CRITICAL -"..., 970) = -1 EAGAIN (Resource temporarily unavailable)
There are over one hundred thousand of these a second, and it is what is driving up the load on the server.
Eric
I think the first post I made is the best place to really look at what is going wrong.
write(3, "job_id=213\0type=1\0command=/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_notification.php --notification-type=service --contact=\"<ommited>\" --contactemail=\"<ommited>\" --type=RECOVERY --escalated=\"0\" --author=\"\" --comments=\"\" --host=\"<ommited>\" --hostaddress=\"<ommited>\" --hostalias=\"<ommited\" --hostdisplayname=\"<ommited>\" --service=\"DiskIO\" --hoststate=UP --hoststateid=0 --servicestate=OK --servicestateid=0 --lastservicestate=CRITICAL -"..., 970) = -1 EAGAIN (Resource temporarily unavailable)
There are over one hundred thousand of these a second, and it is what is driving up the load on the server.
Eric
Re: NagiosXI Zombie process troubles
I set external_command_buffer_slots to 512, no difference.
-
dwhitfield
- Former Nagios Staff
- Posts: 4583
- Joined: Wed Sep 21, 2016 10:29 am
- Location: NoLo, Minneapolis, MN
- Contact:
Re: NagiosXI Zombie process troubles
Did you truncate the table as suggested?
Can you PM me your Profile? You can download it by going to Admin > System Config > System Profile and click the ***Download Profile*** button towards the top. If for whatever reason you *cannot* download the profile, please put the output of View System Info (5.3.4+, Show Profile if older) in the thread (that will at least get us some info). This will give us access to many of the logs we would otherwise ask for individually. If security is a concern, you can unzip the profile take out what you like, and then zip it up again. We may end up needing something you remove, but we can ask for that specifically.
After you PM the profile, please update this thread. Updating this thread is the only way for it to show back up on our dashboard.
Can you PM me your Profile? You can download it by going to Admin > System Config > System Profile and click the ***Download Profile*** button towards the top. If for whatever reason you *cannot* download the profile, please put the output of View System Info (5.3.4+, Show Profile if older) in the thread (that will at least get us some info). This will give us access to many of the logs we would otherwise ask for individually. If security is a concern, you can unzip the profile take out what you like, and then zip it up again. We may end up needing something you remove, but we can ask for that specifically.
After you PM the profile, please update this thread. Updating this thread is the only way for it to show back up on our dashboard.
Re: NagiosXI Zombie process troubles
I PM'd you the profile. I did not end up truncating the table because it eventually stopped deleting entries.
Re: NagiosXI Zombie process troubles
Thanks for the profile.
One thing, the /usr/local partition is almost full and it should be cleaned up or increased in size soon.
Here are a couple of commands you run to find out what or where the space is gone on that partition.
Find the largest 10 files by size command:
Find the highest inode count.
Next, remove this option from the nagios.cfg file
Then run this to truncate the postgres tables to be sure they are clean.
Let us know how it works out.
One thing, the /usr/local partition is almost full and it should be cleaned up or increased in size soon.
Here are a couple of commands you run to find out what or where the space is gone on that partition.
Find the largest 10 files by size command:
Code: Select all
find /usr/local -type f -print0 | xargs -0 du | sort -n | tail -10 | cut -f2 | xargs -I{} du -sh {}Code: Select all
for i in /usr/local/*; do echo $i; find $i |wc -l; doneCode: Select all
external_command_buffer_slots=2048Code: Select all
service nagios stop
service ndo2db stop
service crond stop
service postgresql restart
pkill -9 -u nagios
echo "truncate table xi_events; truncate table xi_meta; truncate table xi_eventqueue;" | psql nagiosxi nagiosxi
service crond start
service ndo2db start
service nagios start
service npcd restartBe sure to check out our Knowledgebase for helpful articles and solutions!
Re: NagiosXI Zombie process troubles
The /usr/local filling up is because of /usr/local/nagiosxi/var/sysstat.log with abunch of lines like below:
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/cLS6KBo': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/cArUDUo': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/c6GCwpk.ok': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/c6GCwpk': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/clav5GR.ok': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/clav5GR': Operation not permitted
I tried what you recommended and it made no difference.
Eric
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/cLS6KBo': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/cArUDUo': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/c6GCwpk.ok': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/c6GCwpk': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/clav5GR.ok': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/clav5GR': Operation not permitted
I tried what you recommended and it made no difference.
Eric