RE: [Nagios-users] distributed monitoring/central server

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

RE: [Nagios-users] distributed monitoring/central server

Post by Guest »

Most of what I said below presumed Nagios was built with "smart" ocsp
functionality built in. The more I think about this, "smart" ocsp
functionality can only be accomplished with a separate daemon or
process, exactly as I have done with the perl script that reads the ocsp
log at given intervals and then forks them into multiple send_nsca
processes. Otherwise, the general check process has to be drawn out
exactly like I experienced. This might be a really good feature for
Nagios 2.0. I cc'd this to nagios-devel for this very reason.

As a follow-up to my last email, I attempted to execute each individual
service (like the default ocsp command shell script on nagios.org)
through send_nsca in the consolidated send_nsca daemon and everything
worked quite well. I'm very satisfied with my results. The most
efficient settings for my implementation seem to be parsing the ocsp.log
every 3 seconds and then forking every 4000 bytes into a send_nsca
process. If you have any questions about what I did please let me know.

-Jason

-----Original Message-----
From: nagios-users-admin@lists.sourceforge.net
[mailto:nagios-users-admin@lists.sourceforge.net] On Behalf Of Jason
Lancaster
Sent: Thursday, May 08, 2003 14:33
To: Ethan Galstad; nagios-users@lists.sourceforge.net
Subject: Re: [Nagios-users] distributed monitoring/central server
performance problems

Ethan and list,
I agree the check command interval I was using may not have been the
most
efficient and I likely would have eventually seen a problem on the
central
server parsing these external commands. After making my first post, I
realized my issue is with regards to how Nagios manages an outgoing ocsp
command in Nagios.

I came to the assumption above by experiencing the following:
In my "non-distributed" test environment, I have 2683 service checks.
I'm
using OCSP with ocsp_timeout=3. This OCSP does not go to any other
systems,
it is just a simple echo, "echo $1 $2 $3 $4 >> ocsp.log." Looking at the
webpage and the status.log, things are updating within a 10-15 minute
interval. This is the behavior I expect and works quite well.

Complicate this echo with a slightly longer to execute command by adding
a
"sleep 3" into the mix and I start having problems. Service and host
update
intervals go from approximately 10 minutes to 15, to 20, to 30... to
never
getting updated. The system stops executing active checks of any type,
including freshness. Nagios becomes useless at this point.

If I comment my sleep line out at this point, Nagios begins to sync back
to
it's normal 10 minute intervals.

I don't know where the problem lies, it very well could be the way I
have
Nagios configured. Personally, I theorize this is due to how Nagios
decides
to manage it's ocsp commands, perhaps if one ocsp command takes a long
time
to execute Nagios thinks that everything needs more execution time. I
don't
know much about C and I don't know the source well, but I'm more than
willing to work with anyone who wants more information on this issue.

I've pretty much given up on handling any advanced ocsp methods within
Nagios and made Nagios execute the ocsp command as quickly as possible
using
a simple bash echo script a fifo. I have to keep in mind important
factors
Ethan discussed with his last reply, therefore I'm sending consolidated
nsca
results at 10 second intervals. This could be lowered to make the
consolidated NSCA parser send each service result through NSCA (just
like
the default ocsp behavior of Nagios in a distributed environment). This
may
in fact work. I have yet to test it but I think I will soon.

Thanks,
Jason

----- Original Message -----
From: "Ethan Galstad"
To:
Sent: Tuesday, May 06, 2003 19:24
Subject: RE: [Nagios-users] distributed monitoring/central server
performance problems


> NSCA may be to blame (consolidated transmits would help), but it is
> more likely that you are experiencing a bottleneck with the external
> command file. This file is implemented as a named pipe, which (under
> Linux) has a size of 4K. If one

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: nagios@nagios.org
Locked