Page 1 of 1

Re: [Nagios-devel] Problems with many hanging Nagios processes

Posted: Mon Dec 18, 2006 4:42 pm
by Guest
=09

We had similar issue. We have a distributed environment with one=
master and 4 slaves. Total number of hosts monitored are 1900+=
and
20000+ services spread across 4 slaves.

At times we saw 14K or more results being sent in a second from=
slaves. This resulted in 100+ nagios processes being created.

Changed reaper frequency to 2 seconds and played with all tunables.
Nothing seemed to help.

Looking at the nagios source,
This is what I found out was happening...

Nagios has a commands file worker thread and when it gets woken=
up, looks if there is data in pipe(nagios.cmd), if exists, forks=
a child process. This will be in a loop and checks the pipe for=
data.

Now what does the forked nagios child process do?
It reads all the data from the pipe one message a time and puts=
it in commands buffer. If if is able to write to buffer, just exits.

The problem here was command buffer had a limited size of 1024.=
This is the default setting in include/nagios.h.in and is in the=
line #define COMMAND_BUFFER_SLOTS 1024.

This was not enough and the child process started to wait for memory=
to be freed so that the pipe data retrieved can be put in buffer.

While this child process waited for memory to be freed, the command=
worker thread got woken up and realized that there is data in pipe=
and forked another child. This got repeated and eventually server=
went out of memory.

Here is what we did to resolve.

1. Edit the include/nagios.h.in
change
#define COMMAND_BUFFER_SLOTS 1024
to
#define COMMAND_BUFFER_SLOTS 60000

And change
#define SERVICE_BUFFER_SLOTS 1024
to
#define SERVICE_BUFFER_SLOTS 60000

2. Run ./configure
(make sure you don't have nano second sleep enabled. Also disable=
perl
interpreter)

3. make all;make install





- Mahesh Kunjal (maheshk)

-----------------------
This thread is located in the archive at this URL:
http://www.nagiosexchange.org/nagios-de ... ttofaq_pi=
1[showUid]=3D13177
=09





This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]