NagiosXI Zombie process troubles

bolson · Post by **bolson** » Fri Aug 18, 2017 10:35 am

In addition, you could run load, memory, and IO checks on the NagiosXI server itself to see if the php failures correspond with resource issues on your server.

ejmorrow · Post by **ejmorrow** » Fri Aug 18, 2017 12:48 pm

We reboot our servers on Sunday. This Sunday when we rebooted it looks like there was a large memory leak or something else that exhausted the memory. The server ended up becoming unresponsive so we rebooted it. Afterwards there was some complaints about msyql tables. So we ran the /usr/local/nagiosxi/scripts/repair_databases.sh which seemed to clear up those issues, but we were running into the issues we're currently having.

I noticed that even with Nagios shutdown the postmaster service remains busy. Looking into it, it looks like it's cleaning up data. Specifically it's doing this non-stop, and the id's are in the millions.

DELETE FROM xi_meta WHERE meta_id = xxxxxxxx

Would it be safe to drop xi_meta table and run repair_databases.sh again, cause I'm not seeing an end in sight for this?

Thanks,

Eric

scottwilkerson · Post by **scottwilkerson** » Fri Aug 18, 2017 1:07 pm

ejmorrow wrote:We reboot our servers on Sunday. This Sunday when we rebooted it looks like there was a large memory leak or something else that exhausted the memory. The server ended up becoming unresponsive so we rebooted it. Afterwards there was some complaints about msyql tables. So we ran the /usr/local/nagiosxi/scripts/repair_databases.sh which seemed to clear up those issues, but we were running into the issues we're currently having.

I noticed that even with Nagios shutdown the postmaster service remains busy. Looking into it, it looks like it's cleaning up data. Specifically it's doing this non-stop, and the id's are in the millions.

DELETE FROM xi_meta WHERE meta_id = xxxxxxxx

Would it be safe to drop xi_meta table and run repair_databases.sh again, cause I'm not seeing an end in sight for this?

Thanks,

Eric

The offending table would be xi_events however, you shouldn't drop the table but you could truncate it without causing harm

Additionally it it worth mentioning, this would not clear up messages already sent to the mail spool

Post by **tgriep** » Fri Aug 18, 2017 1:16 pm

The Zombie processes could also be caused by the external buffer slots option that is set very high in the nagios.cfg file.
Try changing it from

Code: Select all

external_command_buffer_slots=2048

to

Code: Select all

external_command_buffer_slots=512

and see if that cuts down on the defunct processes.
You will have to restart the nagios process for the change to take effect.
If you still see defunct processes, remove that option all together.

ejmorrow · Post by **ejmorrow** » Fri Aug 18, 2017 1:29 pm

It looks like it finally cleared up on its own after leaving Nagios off for a bit. I ran the dd command, and it came back with 250MBs.

I think the first post I made is the best place to really look at what is going wrong.

write(3, "job_id=213\0type=1\0command=/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_notification.php --notification-type=service --contact=\"<ommited>\" --contactemail=\"<ommited>\" --type=RECOVERY --escalated=\"0\" --author=\"\" --comments=\"\" --host=\"<ommited>\" --hostaddress=\"<ommited>\" --hostalias=\"<ommited\" --hostdisplayname=\"<ommited>\" --service=\"DiskIO\" --hoststate=UP --hoststateid=0 --servicestate=OK --servicestateid=0 --lastservicestate=CRITICAL -"..., 970) = -1 EAGAIN (Resource temporarily unavailable)

There are over one hundred thousand of these a second, and it is what is driving up the load on the server.

Eric

ejmorrow · Post by **ejmorrow** » Fri Aug 18, 2017 1:35 pm

I set external_command_buffer_slots to 512, no difference.

dwhitfield · Post by **dwhitfield** » Fri Aug 18, 2017 1:56 pm

Did you truncate the table as suggested?

Can you PM me your Profile? You can download it by going to Admin > System Config > System Profile and click the ***Download Profile*** button towards the top. If for whatever reason you *cannot* download the profile, please put the output of View System Info (5.3.4+, Show Profile if older) in the thread (that will at least get us some info). This will give us access to many of the logs we would otherwise ask for individually. If security is a concern, you can unzip the profile take out what you like, and then zip it up again. We may end up needing something you remove, but we can ask for that specifically.

After you PM the profile, please update this thread. Updating this thread is the only way for it to show back up on our dashboard.

ejmorrow · Post by **ejmorrow** » Fri Aug 18, 2017 2:44 pm

I PM'd you the profile. I did not end up truncating the table because it eventually stopped deleting entries.

Post by **tgriep** » Fri Aug 18, 2017 3:34 pm

Thanks for the profile.
One thing, the /usr/local partition is almost full and it should be cleaned up or increased in size soon.
Here are a couple of commands you run to find out what or where the space is gone on that partition.

Find the largest 10 files by size command:

Code: Select all

find /usr/local -type f -print0 | xargs -0 du | sort -n | tail -10 | cut -f2 | xargs -I{} du -sh {}

Find the highest inode count.

Code: Select all

for i in /usr/local/*; do echo $i; find $i |wc -l; done

Next, remove this option from the nagios.cfg file

Code: Select all

external_command_buffer_slots=2048

Then run this to truncate the postgres tables to be sure they are clean.

Code: Select all

service nagios stop
service ndo2db stop
service crond stop
service postgresql restart
pkill -9 -u nagios
echo "truncate table xi_events; truncate table xi_meta; truncate table xi_eventqueue;" | psql nagiosxi nagiosxi
service crond start
service ndo2db start
service nagios start
service npcd restart

Let us know how it works out.

ejmorrow · Post by **ejmorrow** » Fri Aug 18, 2017 4:09 pm

The /usr/local filling up is because of /usr/local/nagiosxi/var/sysstat.log with abunch of lines like below:

chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/cLS6KBo': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/cArUDUo': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/c6GCwpk.ok': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/c6GCwpk': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/clav5GR.ok': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/clav5GR': Operation not permitted

I tried what you recommended and it made no difference.

Eric

Nagios Support Forum

NagiosXI Zombie process troubles

Re: NagiosXI Zombie process troubles

Re: NagiosXI Zombie process troubles

Re: NagiosXI Zombie process troubles

Re: NagiosXI Zombie process troubles

Re: NagiosXI Zombie process troubles

Re: NagiosXI Zombie process troubles

Re: NagiosXI Zombie process troubles

Re: NagiosXI Zombie process troubles

Re: NagiosXI Zombie process troubles

Re: NagiosXI Zombie process troubles