[Nagios-devel] Distributed monitoring setup

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

[Nagios-devel] Distributed monitoring setup

Post by Guest »

This is a multi-part message in MIME format.

------=_NextPart_000_0053_01C242BE.C7D4D9B0
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit

Greetings,

I am wishing to discuss with others their Nagios setup in a distributed
environment.

We have 3 servers running Nagios - 1 x external and 2 x internal. The
external server has outside parties having limited views to their
hosts/connections into our network. The 2 internal servers are currently
setup in a distributed environment, with one server sending results to
the other (via nsca) due to it's geographic location on our network. The
'central server' not only collects the results from the other
distributed server, but also actively checks approximately half of the
total number of hosts and services.

My reading of the Nagios documentation shows that it is assumed the
central server only accepts results from the distributed servers rather
than actively checking hosts and services itself. However, I see no
reason as to why the central server can not also actively check - there
is no design issue that I am aware of.

How do others run their distributed setup ?


Also, we believe that there are scheduling issues with Nagios under
different Linux kernels. With a 2.2.20 kernel in a distributed setup, we
found that the number of Nagios processes continued to grow - ie: there
was no reaping. An strace of a child process showed that it was waiting
to write to the external command files, while an strace of the parent
process showed no errors and the reaping worked as expected.

Therefore, we modified the start script for Nagios to include an strace
of the parent process and ran fine with this for many months. This is
with Nagios 1.0a7 through 1.0b3 and the previously undocumented
'command_check_interval=-1'.

Recently we upgraded the monitoring hosts to a 2.4 kernel, and
discovered an entirely different problem. The number of Nagios processes
grows exponentially until the load on the box is so large that a hard
reset is required. Again, the children processes do not appear to be
being reaped as would be expected. An strace of the child processes
shows that they are waiting on a write to the internal pipe (Nagios
parent process) after reading the results from the external command
file.

We have tried numerous ways of trying to correct this problem, including
upgrading to Nagios 1.0b4 and also including the latest base/checks.c
from CVS but can not get Nagios to sufficiently reap the children
processes. So, until we can resolve this problem we have been forced to
downgrade back to the 2.2 kernel, where Nagios 1.0b4 and base/checks.c
works fine (though with the strace on the parent process).


So, I would be interested in discussing with others who are running
Nagios in a distributed setup under Linux as to whether or not they are
experiencing similar issues.

Regards,

Andrew


------=_NextPart_000_0053_01C242BE.C7D4D9B0
Content-Type: text/html;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable




Message



Greetings,
 
I am =
wishing to=20
discuss with others their Nagios setup in a distributed=20
environment.
 
We =
have 3 servers=20
running Nagios - 1 x external and 2 x internal. The external server has =
outside=20
parties having limited views to their hosts/connections into our =
network. The 2=20
internal servers are currently setup in a distributed environment, with =
one=20
serve

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: andrew_kemp@pacific.net.au
Locked