Page 1 of 1

[Nagios-devel] RFC: Scheduler behaviour

Posted: Tue Jun 15, 2004 5:34 am
by Guest
Hello,

in these days I was analyzing the Nagios scheduler activity and I've
noticed a strange schema of execution about plugins/checks.

I've about 1000 checks to do in a short time so I've set the concurrent
check parameter to 90 checks per time.

Looking at the processes activity on the server I've noticed that the 90
processes are normally forked, then, one after the other they are
completed regularly.
A Nagios process for each forked plugin remains still... it seems
waiting for the other processes to finish.
All flows perfectly until there are some hosts in down state that need
the use of check-host-alive by Nagios.

What happens is that all the "still" Nagios processes waiting for the
last one to exit remain "still" until the check-host-alive command
completed.

Acting this way the next 90 checks are not fired immediately by Nagios
but only after the last check-host-alive command completes, thus,
wasting a lot of time and incrementing the total time of execution for
all the checks (actually 1000 checks performed in about 20 minutes).
To workaround this I've replaced ping with fping that is faster (on
check-host-alive), reduced fping retries to 3 instead of 5 and reduced
the timeout values in nagios.cfg. Now Nagios is faster but it still
waits for the slowness of check-host-alive.

Maybe that I'm wrong, in fact I've NOT browsed into the scheduler
sources of Nagios but it seems that the scheduler isn't written to
maximize CPU utilization.
What I've noticed is that:
1) Nagios forks 90 processes
2) Waits for 90 subprocesses (plugins) to finish
3) Collects and stores all results
4) Prepares the next 90 processes and then GOTO 1

Why Nagios waits all 90 processes before forking again other checks? If
80 checks have been already done and 10 are slow (ie. check-host-alive)
why have I to wait to fork 80 checks more? I mean, the scheduler is it
intended to work this way or is a yet to come feature to have always 90
concurrent active checks, treating them indipendently?

The following screen captures will show the situation in real time to
better explain what I've described.

The scheduler forks the checks:

[creator@guardian creator]$ ps ax
PID TTY STAT TIME COMMAND
4598 ? S 8:28 /usr/local/nagios/bin/nagios -d
/usr/local/nagios/etc/nagios.cfg
13648 ? S 0:00 /usr/local/nagios/bin/nagios -d
/usr/local/nagios/etc/nagios.cfg
13651 ? S 0:00 /usr/local/nagios/bin/nagios -d
/usr/local/nagios/etc/nagios.cfg
13657 ? S 0:00 /usr/local/nagios/bin/nagios -d
/usr/local/nagios/etc/nagios.cfg
13660 ? S 0:00 /usr/local/nagios/bin/nagios -d
/usr/local/nagios/etc/nagios.cfg
13663 ? S 0:00 /usr/local/nagios/bin/nagios -d
/usr/local/nagios/etc/nagios.cfg
13665 ? S 0:00 /usr/local/nagios/bin/nagios -d
/usr/local/nagios/etc/nagios.cfg
13670 ? S 0:00 /usr/local/nagios/bin/nagios -d
/usr/local/nagios/etc/nagios.cfg
13676 ? S 0:00 /usr/local/nagios/bin/nagios -d
/usr/local/nagios/etc/nagios.cfg
13680 ? S 0:00 /usr/local/nagios/bin/nagios -d
/usr/local/nagios/etc/nagios.cfg
13684 ? S 0:00 /usr/local/nagios/bin/nagios -d
/usr/local/nagios/etc/nagios.cfg
13687 ? S 0:00 /usr/local/nagios/bin/nagios -d
/usr/local/nagios/etc/nagios.cfg
13690 ? S 0:00 /usr/local/nagios/bin/nagios -d
/usr/local/nagios/etc/nagios.cfg
13692 ? S 0:00 /usr/local/nagios/bin/nagios -d
/usr/local/nagios/etc/nagios.cfg
13700 ? S 0:00 /usr/local/nagios/bin/nagios -d
/usr/local/nagios/etc/nagios.cfg
13704 ? S 0:00 /usr/local/nagios/bin/nagios -d
/usr/local/nagios/etc/nagios.cfg
13708 ? S 0:00 /usr/local/nagios/bin/nagios -d
/usr/local/nagios/etc/nagios.cfg
13712 ? S 0:00 /usr/local/nagios/bin/nagios -d
/usr/local/nagios/etc/nagios.cfg
13715 ? S 0:00 /usr/local/nagios/bin/nagios -d
/usr/local/nagios/etc/nagios.cfg
13718 ? S 0:00 /usr/local/nagios/bin/nagios -d
/usr/local/nagios/etc/nagios.cfg
13720 ? S 0:00 /usr/local/nagios/bin/

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]