NagiosXI Zombie process troubles

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
bolson

Re: NagiosXI Zombie process troubles

Post by bolson »

In addition, you could run load, memory, and IO checks on the NagiosXI server itself to see if the php failures correspond with resource issues on your server.
ejmorrow
Posts: 13
Joined: Fri May 13, 2016 9:02 am

Re: NagiosXI Zombie process troubles

Post by ejmorrow »

We reboot our servers on Sunday. This Sunday when we rebooted it looks like there was a large memory leak or something else that exhausted the memory. The server ended up becoming unresponsive so we rebooted it. Afterwards there was some complaints about msyql tables. So we ran the /usr/local/nagiosxi/scripts/repair_databases.sh which seemed to clear up those issues, but we were running into the issues we're currently having.

I noticed that even with Nagios shutdown the postmaster service remains busy. Looking into it, it looks like it's cleaning up data. Specifically it's doing this non-stop, and the id's are in the millions.

DELETE FROM xi_meta WHERE meta_id = xxxxxxxx

Would it be safe to drop xi_meta table and run repair_databases.sh again, cause I'm not seeing an end in sight for this?

Thanks,

Eric
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: NagiosXI Zombie process troubles

Post by scottwilkerson »

ejmorrow wrote:We reboot our servers on Sunday. This Sunday when we rebooted it looks like there was a large memory leak or something else that exhausted the memory. The server ended up becoming unresponsive so we rebooted it. Afterwards there was some complaints about msyql tables. So we ran the /usr/local/nagiosxi/scripts/repair_databases.sh which seemed to clear up those issues, but we were running into the issues we're currently having.

I noticed that even with Nagios shutdown the postmaster service remains busy. Looking into it, it looks like it's cleaning up data. Specifically it's doing this non-stop, and the id's are in the millions.

DELETE FROM xi_meta WHERE meta_id = xxxxxxxx

Would it be safe to drop xi_meta table and run repair_databases.sh again, cause I'm not seeing an end in sight for this?

Thanks,

Eric
The offending table would be xi_events however, you shouldn't drop the table but you could truncate it without causing harm

Additionally it it worth mentioning, this would not clear up messages already sent to the mail spool
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: NagiosXI Zombie process troubles

Post by tgriep »

The Zombie processes could also be caused by the external buffer slots option that is set very high in the nagios.cfg file.
Try changing it from

Code: Select all

external_command_buffer_slots=2048
to

Code: Select all

external_command_buffer_slots=512
and see if that cuts down on the defunct processes.
You will have to restart the nagios process for the change to take effect.
If you still see defunct processes, remove that option all together.
Be sure to check out our Knowledgebase for helpful articles and solutions!
ejmorrow
Posts: 13
Joined: Fri May 13, 2016 9:02 am

Re: NagiosXI Zombie process troubles

Post by ejmorrow »

It looks like it finally cleared up on its own after leaving Nagios off for a bit. I ran the dd command, and it came back with 250MBs.

I think the first post I made is the best place to really look at what is going wrong.

write(3, "job_id=213\0type=1\0command=/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_notification.php --notification-type=service --contact=\"<ommited>\" --contactemail=\"<ommited>\" --type=RECOVERY --escalated=\"0\" --author=\"\" --comments=\"\" --host=\"<ommited>\" --hostaddress=\"<ommited>\" --hostalias=\"<ommited\" --hostdisplayname=\"<ommited>\" --service=\"DiskIO\" --hoststate=UP --hoststateid=0 --servicestate=OK --servicestateid=0 --lastservicestate=CRITICAL -"..., 970) = -1 EAGAIN (Resource temporarily unavailable)

There are over one hundred thousand of these a second, and it is what is driving up the load on the server.

Eric
ejmorrow
Posts: 13
Joined: Fri May 13, 2016 9:02 am

Re: NagiosXI Zombie process troubles

Post by ejmorrow »

I set external_command_buffer_slots to 512, no difference.
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: NagiosXI Zombie process troubles

Post by dwhitfield »

Did you truncate the table as suggested?

Can you PM me your Profile? You can download it by going to Admin > System Config > System Profile and click the ***Download Profile*** button towards the top. If for whatever reason you *cannot* download the profile, please put the output of View System Info (5.3.4+, Show Profile if older) in the thread (that will at least get us some info). This will give us access to many of the logs we would otherwise ask for individually. If security is a concern, you can unzip the profile take out what you like, and then zip it up again. We may end up needing something you remove, but we can ask for that specifically.

After you PM the profile, please update this thread. Updating this thread is the only way for it to show back up on our dashboard.
ejmorrow
Posts: 13
Joined: Fri May 13, 2016 9:02 am

Re: NagiosXI Zombie process troubles

Post by ejmorrow »

I PM'd you the profile. I did not end up truncating the table because it eventually stopped deleting entries.
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: NagiosXI Zombie process troubles

Post by tgriep »

Thanks for the profile.
One thing, the /usr/local partition is almost full and it should be cleaned up or increased in size soon.
Here are a couple of commands you run to find out what or where the space is gone on that partition.

Find the largest 10 files by size command:

Code: Select all

find /usr/local -type f -print0 | xargs -0 du | sort -n | tail -10 | cut -f2 | xargs -I{} du -sh {}
Find the highest inode count.

Code: Select all

for i in /usr/local/*; do echo $i; find $i |wc -l; done
Next, remove this option from the nagios.cfg file

Code: Select all

external_command_buffer_slots=2048
Then run this to truncate the postgres tables to be sure they are clean.

Code: Select all

service nagios stop
service ndo2db stop
service crond stop
service postgresql restart
pkill -9 -u nagios
echo "truncate table xi_events; truncate table xi_meta; truncate table xi_eventqueue;" | psql nagiosxi nagiosxi
service crond start
service ndo2db start
service nagios start
service npcd restart
Let us know how it works out.
Be sure to check out our Knowledgebase for helpful articles and solutions!
ejmorrow
Posts: 13
Joined: Fri May 13, 2016 9:02 am

Re: NagiosXI Zombie process troubles

Post by ejmorrow »

The /usr/local filling up is because of /usr/local/nagiosxi/var/sysstat.log with abunch of lines like below:

chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/cLS6KBo': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/cArUDUo': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/c6GCwpk.ok': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/c6GCwpk': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/clav5GR.ok': Operation not permitted
chown: changing ownership of `/var/nagiosramdisk/spool/checkresults/clav5GR': Operation not permitted

I tried what you recommended and it made no difference.

Eric
Locked