Page 2 of 4

Re: NagiosXI Zombie process troubles

Posted: Fri Aug 18, 2017 10:35 am
by bolson
In addition, you could run load, memory, and IO checks on the NagiosXI server itself to see if the php failures correspond with resource issues on your server.

Re: NagiosXI Zombie process troubles

Posted: Fri Aug 18, 2017 12:48 pm
by ejmorrow
We reboot our servers on Sunday. This Sunday when we rebooted it looks like there was a large memory leak or something else that exhausted the memory. The server ended up becoming unresponsive so we rebooted it. Afterwards there was some complaints about msyql tables. So we ran the /usr/local/nagiosxi/scripts/repair_databases.sh which seemed to clear up those issues, but we were running into the issues we're currently having.

I noticed that even with Nagios shutdown the postmaster service remains busy. Looking into it, it looks like it's cleaning up data. Specifically it's doing this non-stop, and the id's are in the millions.

DELETE FROM xi_meta WHERE meta_id = xxxxxxxx

Would it be safe to drop xi_meta table and run repair_databases.sh again, cause I'm not seeing an end in sight for this?

Thanks,

Eric

Re: NagiosXI Zombie process troubles

Posted: Fri Aug 18, 2017 1:07 pm
by scottwilkerson
ejmorrow wrote:We reboot our servers on Sunday. This Sunday when we rebooted it looks like there was a large memory leak or something else that exhausted the memory. The server ended up becoming unresponsive so we rebooted it. Afterwards there was some complaints about msyql tables. So we ran the /usr/local/nagiosxi/scripts/repair_databases.sh which seemed to clear up those issues, but we were running into the issues we're currently having.

I noticed that even with Nagios shutdown the postmaster service remains busy. Looking into it, it looks like it's cleaning up data. Specifically it's doing this non-stop, and the id's are in the millions.

DELETE FROM xi_meta WHERE meta_id = xxxxxxxx

Would it be safe to drop xi_meta table and run repair_databases.sh again, cause I'm not seeing an end in sight for this?

Thanks,

Eric
The offending table would be xi_events however, you shouldn't drop the table but you could truncate it without causing harm

Additionally it it worth mentioning, this would not clear up messages already sent to the mail spool

Re: NagiosXI Zombie process troubles

Posted: Fri Aug 18, 2017 1:16 pm
by tgriep
The Zombie processes could also be caused by the external buffer slots option that is set very high in the nagios.cfg file.
Try changing it from

Code: Select all

external_command_buffer_slots=2048
to

Code: Select all

external_command_buffer_slots=512
and see if that cuts down on the defunct processes.
You will have to restart the nagios process for the change to take effect.
If you still see defunct processes, remove that option all together.

Re: NagiosXI Zombie process troubles

Posted: Fri Aug 18, 2017 1:29 pm
by ejmorrow
It looks like it finally cleared up on its own after leaving Nagios off for a bit. I ran the dd command, and it came back with 250MBs.

I think the first post I made is the best place to really look at what is going wrong.

write(3, "job_id=213\0type=1\0command=/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_notification.php --notification-type=service --contact=\"<ommited>\" --contactemail=\"<ommited>\" --type=RECOVERY --escalated=\"0\" --author=\"\" --comments=\"\" --host=\"<ommited>\" --hostaddress=\"<ommited>\" --hostalias=\"<ommited\" --hostdisplayname=\"<ommited>\" --service=\"DiskIO\" --hoststate=UP --hoststateid=0 --servicestate=OK --servicestateid=0 --lastservicestate=CRITICAL -"..., 970) = -1 EAGAIN (Resource temporarily unavailable)

There are over one hundred thousand of these a second, and it is what is driving up the load on the server.

Eric

Re: NagiosXI Zombie process troubles

Posted: Fri Aug 18, 2017 1:35 pm
by ejmorrow
I set external_command_buffer_slots to 512, no difference.

Re: NagiosXI Zombie process troubles

Posted: Fri Aug 18, 2017 1:56 pm
by dwhitfield
Did you truncate the table as suggested?

Can you PM me your Profile? You can download it by going to Admin > System Config > System Profile and click the ***Download Profile*** button towards the top. If for whatever reason you *cannot* download the profile, please put the output of View System Info (5.3.4+, Show Profile if older) in the thread (that will at least get us some info). This will give us access to many of the logs we would otherwise ask for individually. If security is a concern, you can unzip the profile take out what you like, and then zip it up again. We may end up needing something you remove, but we can ask for that specifically.

After you PM the profile, please update this thread. Updating this thread is the only way for it to show back up on our dashboard.

Re: NagiosXI Zombie process troubles

Posted: Fri Aug 18, 2017 2:44 pm
by ejmorrow
I PM'd you the profile. I did not end up truncating the table because it eventually stopped deleting entries.

Re: NagiosXI Zombie process troubles

Posted: Fri Aug 18, 2017 3:34 pm
by tgriep
Thanks for the profile.
One thing, the /usr/local partition is almost full and it should be cleaned up or increased in size soon.
Here are a couple of commands you run to find out what or where the space is gone on that partition.

Find the largest 10 files by size command:

Code: Select all

find /usr/local -type f -print0 | xargs -0 du | sort -n | tail -10 | cut -f2 | xargs -I{} du -sh {}
Find the highest inode count.

Code: Select all

for i in /usr/local/*; do echo $i; find $i |wc -l; done
Next, remove this option from the nagios.cfg file

Code: Select all

external_command_buffer_slots=2048
Then run this to truncate the postgres tables to be sure they are clean.

Code: Select all

service nagios stop
service ndo2db stop
service crond stop
service postgresql restart
pkill -9 -u nagios
echo "truncate table xi_events; truncate table xi_meta; truncate table xi_eventqueue;" | psql nagiosxi nagiosxi
service crond start
service ndo2db start
service nagios start
service npcd restart
Let us know how it works out.

Re: NagiosXI Zombie process troubles

Posted: Fri Aug 18, 2017 4:09 pm
by ejmorrow
The /usr/local filling up is because of /usr/local/nagiosxi/var/sysstat.log with abunch of lines like below:

chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/cLS6KBo': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/cArUDUo': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/c6GCwpk.ok': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/c6GCwpk': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/clav5GR.ok': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/clav5GR': Operation not permitted

I tried what you recommended and it made no difference.

Eric