[Nagios-devel] Nagios spawning rogue nagios processes eventually crashing Nagios server
Posted: Wed Sep 07, 2005 1:11 am
Hi All,
I'm having a bit of a problem with my Central Nagios server - it is
leaking memory because it is spawning nagios processes without
closing them.
Note: information such as OS flavour and version, architecture,
compile options etc, are listed at the end of this email.
I have 6 Nagios servers, 5 are distributed, and report to the the 6th
Nagios server (the Central Nagios server).
All Nagios servers are installed from a custom built RPM. Details of
the build environment are listed below. Distributed Nagios servers
are sending service check results to the Central Nagios server via
NSCA.
The central Nagios server is running Nagios v2.0b4. I am using NSCA
v2.4.
The central Nagios server is receiving passive check results from the
5 distributed servers. It is receiving results from 82 hosts and
1300 services.
I believe the reason Nagios is leaking memory has something to do
with processing the performance data. I am using nagiosgraph v0.4
(nagiosgraph.sf.net) to process performance data. I am using the
default processing method (ie nagiosgraph is run everytime a service
check result is received by the central Nagios server).
As per the documentation, I believe everytime a service check is
received, Nagios will spawn a new Nagios instance to run the
performance processing command.
The problem is, that over time, hundreds of rogue Nagios processes
end up running on the Central Nagios server and never closing
themselves. Each rogue Nagios process chews up memory, and
eventually the machine runs out of memory and swap, rendering the
machine unusable (has to be rebooted). Each rogue nagios process is
listed as having process 1 as its parent, rather than the master
nagios process, which is strange, so it appears to be getting
separated from its parent at some point. I believe this may be
caused by many processes competing to write to the same file
(possibly /var/log/nagios/rw/nagios.cmd), but due to locking or race
conditions being unable to and thus remaining running permanently.
The performance processing command being run is:
/usr/bin/perl /usr/local/nagiosrrd/insert_fast.pl
"$LASTSERVICECHECK$||$HOSTNAME$||$SERVICEDESC$||$SERVICEOUTPUT$||$SERV
ICEPERFDATA$"
The command is being run via Perl itself, I have not compiled in
embedded Perl support.
(Perl version)
# perl -v
This is perl, v5.8.0 built for i386-linux-thread-multi
(with 1 registered patch, see perl -V for more detail)
I have no problems whatsoever if I reduce the number of hosts sending
their results to the Central Nagios server. With 8 hosts and about
100 services, there is no memory loss. The CPU usage % is about 10%
(5% IO wait, 3% system, 2% user).
I also have no problems with memory loss if I disable processing
performance data.
With 1300 services reporting, this equates to about 4 services per
second (check performed every 5 minutes).
There is never more than 1 or 2 perl processes running, so Perl is
running the performance processing script fine and exiting.
With 1300 services reporting, the CPU usage % is at 100% continually
(1-3% IO wait, 80-90% system, 10-15% user). The load average isn't
too bad:
# uptime
19:52:13 up 3:24, 1 user, load average: 2.40, 2.78, 2.57
Sample snippet of the process listing (full process listing below):
(Currently after being up 3.5 hours, there are 227 nagios processes,
and 131 nsca processes running).
...
nagios 27775 1 0 17:05 ? 00:00:00 /usr/bin/nagios -d
/etc/nagios/nagios.cfg
nagios 30478 1 0 17:09 ? 00:00:00 /usr/bin/nagios -d
/etc/nagios/nagios.cfg
nagios 31653 1 0 17:10 ? 00:00:00 /usr/bin/nagios -d
/etc/nagios/nagios.cfg
nagios 465 1 0 17:12 ? 00:00:00 /usr/bin/nagios -d
/etc/nagios/nagios.cfg
nagios 834 1 0 17:13 ? 00:00:00 /usr/bin/nagios -d
/etc/nagios/nagios.cfg
nagios 1935 1 0 17:14 ? 00:00:00 /usr/bin/nagios -d
/etc/nagios/nagios.cfg
nagios 5738 1 0 17:19 ? 00:00:00 /usr/bin/nagios -d
/etc/nagios/nagios.cfg
nagios 6068 1 0 17:20 ? 00:00:00 /usr/bin/nagios -d
/e
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
I'm having a bit of a problem with my Central Nagios server - it is
leaking memory because it is spawning nagios processes without
closing them.
Note: information such as OS flavour and version, architecture,
compile options etc, are listed at the end of this email.
I have 6 Nagios servers, 5 are distributed, and report to the the 6th
Nagios server (the Central Nagios server).
All Nagios servers are installed from a custom built RPM. Details of
the build environment are listed below. Distributed Nagios servers
are sending service check results to the Central Nagios server via
NSCA.
The central Nagios server is running Nagios v2.0b4. I am using NSCA
v2.4.
The central Nagios server is receiving passive check results from the
5 distributed servers. It is receiving results from 82 hosts and
1300 services.
I believe the reason Nagios is leaking memory has something to do
with processing the performance data. I am using nagiosgraph v0.4
(nagiosgraph.sf.net) to process performance data. I am using the
default processing method (ie nagiosgraph is run everytime a service
check result is received by the central Nagios server).
As per the documentation, I believe everytime a service check is
received, Nagios will spawn a new Nagios instance to run the
performance processing command.
The problem is, that over time, hundreds of rogue Nagios processes
end up running on the Central Nagios server and never closing
themselves. Each rogue Nagios process chews up memory, and
eventually the machine runs out of memory and swap, rendering the
machine unusable (has to be rebooted). Each rogue nagios process is
listed as having process 1 as its parent, rather than the master
nagios process, which is strange, so it appears to be getting
separated from its parent at some point. I believe this may be
caused by many processes competing to write to the same file
(possibly /var/log/nagios/rw/nagios.cmd), but due to locking or race
conditions being unable to and thus remaining running permanently.
The performance processing command being run is:
/usr/bin/perl /usr/local/nagiosrrd/insert_fast.pl
"$LASTSERVICECHECK$||$HOSTNAME$||$SERVICEDESC$||$SERVICEOUTPUT$||$SERV
ICEPERFDATA$"
The command is being run via Perl itself, I have not compiled in
embedded Perl support.
(Perl version)
# perl -v
This is perl, v5.8.0 built for i386-linux-thread-multi
(with 1 registered patch, see perl -V for more detail)
I have no problems whatsoever if I reduce the number of hosts sending
their results to the Central Nagios server. With 8 hosts and about
100 services, there is no memory loss. The CPU usage % is about 10%
(5% IO wait, 3% system, 2% user).
I also have no problems with memory loss if I disable processing
performance data.
With 1300 services reporting, this equates to about 4 services per
second (check performed every 5 minutes).
There is never more than 1 or 2 perl processes running, so Perl is
running the performance processing script fine and exiting.
With 1300 services reporting, the CPU usage % is at 100% continually
(1-3% IO wait, 80-90% system, 10-15% user). The load average isn't
too bad:
# uptime
19:52:13 up 3:24, 1 user, load average: 2.40, 2.78, 2.57
Sample snippet of the process listing (full process listing below):
(Currently after being up 3.5 hours, there are 227 nagios processes,
and 131 nsca processes running).
...
nagios 27775 1 0 17:05 ? 00:00:00 /usr/bin/nagios -d
/etc/nagios/nagios.cfg
nagios 30478 1 0 17:09 ? 00:00:00 /usr/bin/nagios -d
/etc/nagios/nagios.cfg
nagios 31653 1 0 17:10 ? 00:00:00 /usr/bin/nagios -d
/etc/nagios/nagios.cfg
nagios 465 1 0 17:12 ? 00:00:00 /usr/bin/nagios -d
/etc/nagios/nagios.cfg
nagios 834 1 0 17:13 ? 00:00:00 /usr/bin/nagios -d
/etc/nagios/nagios.cfg
nagios 1935 1 0 17:14 ? 00:00:00 /usr/bin/nagios -d
/etc/nagios/nagios.cfg
nagios 5738 1 0 17:19 ? 00:00:00 /usr/bin/nagios -d
/etc/nagios/nagios.cfg
nagios 6068 1 0 17:20 ? 00:00:00 /usr/bin/nagios -d
/e
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]