Re: [Nagios-devel] Problems with many hanging Nagios

Guest · Post by **Guest** » Sat Jan 20, 2007 10:10 am

Old issue but I want to thank everyone that identified this and got a
fix into nagios. We rarely have significant outages but when we did I
would see a backlog of nagios processes (hundreds) but no passive check
results being processed. We had a network outage today and using the new
patches I was able to see that we were hitting Total Check Result
Buffers and adjust accordingly.

My problem is that while I no longer have the daemon accumulation, the
result buffer isn't being processed. I have my service_reaper_frequency
set to 2 and command_check_interval=3D-1 but I don't see status updates
for my passive checks that are coming in. nagiostats output below. I
have all my passive checks on a 5 minute interval and I see them coming
in but you can see below that nagios hasn't processed any of the results
in at least 5 minutes. Any suggestions would be appreciated.

Nagios Stats 2.7
Copyright (c) 2003-2007 Ethan Galstad (www.nagios.org)
Last Modified: 01-19-2007
License: GPL

CURRENT STATUS DATA
----------------------------------------------------
Status File: /usr/local/nagios/var/status.dat
Status File Age: 0d 0h 0m 33s
Status File Version: 2.7

Program Running Time: 0d 0h 14m 58s
Nagios PID: 4208
Used/High/Total Command Buffers: 53 / 98 / 16384
Used/High/Total Check Result Buffers: 7733 / 7749 / 16384

Total Services: 3935
Services Checked: 3935
Services Scheduled: 25
Active Service Checks: 25
Passive Service Checks: 3910
Total Service State Change: 0.000 / 17.570 / 1.069 %
Active Service Latency: 0.004 / 0.837 / 0.232 sec
Active Service Execution Time: 0.111 / 9.708 / 3.697 sec
Active Service State Change: 0.000 / 11.710 / 0.468 %
Active Services Last 1/5/15/60 min: 0 / 0 / 0 / 25
Passive Service State Change: 0.000 / 17.570 / 1.072 %
Passive Services Last 1/5/15/60 min: 0 / 0 / 3337 / 3910
Services Ok/Warn/Unk/Crit: 3456 / 2 / 7 / 470
Services Flapping: 0
Services In Downtime: 0

Total Hosts: 2613
Hosts Checked: 2613
Hosts Scheduled: 0
Active Host Checks: 2613
Passive Host Checks: 0
Total Host State Change: 0.000 / 0.000 / 0.000 %
Active Host Latency: 0.000 / 0.000 / 0.000 sec
Active Host Execution Time: 0.000 / 0.131 / 0.000 sec
Active Host State Change: 0.000 / 0.000 / 0.000 %
Active Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Passive Host State Change: 0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Hosts Up/Down/Unreach: 2613 / 0 / 0
Hosts Flapping: 0
Hosts In Downtime: 0

> -----Original Message-----
> From: [email protected] [mailto:nagios-devel-
> [email protected]] On Behalf Of Mahesh Kunjal
> Sent: Monday, December 18, 2006 6:43 PM
> To: [email protected]; [email protected]; linux-system-
> [email protected]
> Subject: Re: [Nagios-devel] Problems with many hanging Nagios
> processes(Nagios spawning rogue nagios processes eventually
crashingNagios
> server)
>=20
>=20
>=20
> We had similar issue. We have a distributed environment with one
master
> and 4 slaves. Total number of hosts monitored are 1900+ and
> 20000+ services spread across 4 slaves.
>=20
> At times we saw 14K or more results being sent in a second from
slaves.
> This resulted in 100+ nagios processes being created.
>=20
> Changed reaper frequency to 2 seconds and played with all tunables.
> Nothing seemed to help.
>=20
> Looking at the nagios source,
> This is what I found out was happening...
>=20
> Nagios has a commands file worker thread and when it gets woken up,
looks
> if there is data in pipe(nagios.cmd), if exists, forks a child
process.
> This will be i

...[email truncated]...

This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]