Re: [Nagios-devel] Problems with many hanging Nagios

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] Problems with many hanging Nagios

Post by Guest »

Old issue but I want to thank everyone that identified this and got a
fix into nagios. We rarely have significant outages but when we did I
would see a backlog of nagios processes (hundreds) but no passive check
results being processed. We had a network outage today and using the new
patches I was able to see that we were hitting Total Check Result
Buffers and adjust accordingly.

My problem is that while I no longer have the daemon accumulation, the
result buffer isn't being processed. I have my service_reaper_frequency
set to 2 and command_check_interval=3D-1 but I don't see status updates
for my passive checks that are coming in. nagiostats output below. I
have all my passive checks on a 5 minute interval and I see them coming
in but you can see below that nagios hasn't processed any of the results
in at least 5 minutes. Any suggestions would be appreciated.

Nagios Stats 2.7
Copyright (c) 2003-2007 Ethan Galstad (www.nagios.org)
Last Modified: 01-19-2007
License: GPL

CURRENT STATUS DATA
----------------------------------------------------
Status File: /usr/local/nagios/var/status.dat
Status File Age: 0d 0h 0m 33s
Status File Version: 2.7

Program Running Time: 0d 0h 14m 58s
Nagios PID: 4208
Used/High/Total Command Buffers: 53 / 98 / 16384
Used/High/Total Check Result Buffers: 7733 / 7749 / 16384

Total Services: 3935
Services Checked: 3935
Services Scheduled: 25
Active Service Checks: 25
Passive Service Checks: 3910
Total Service State Change: 0.000 / 17.570 / 1.069 %
Active Service Latency: 0.004 / 0.837 / 0.232 sec
Active Service Execution Time: 0.111 / 9.708 / 3.697 sec
Active Service State Change: 0.000 / 11.710 / 0.468 %
Active Services Last 1/5/15/60 min: 0 / 0 / 0 / 25
Passive Service State Change: 0.000 / 17.570 / 1.072 %
Passive Services Last 1/5/15/60 min: 0 / 0 / 3337 / 3910
Services Ok/Warn/Unk/Crit: 3456 / 2 / 7 / 470
Services Flapping: 0
Services In Downtime: 0

Total Hosts: 2613
Hosts Checked: 2613
Hosts Scheduled: 0
Active Host Checks: 2613
Passive Host Checks: 0
Total Host State Change: 0.000 / 0.000 / 0.000 %
Active Host Latency: 0.000 / 0.000 / 0.000 sec
Active Host Execution Time: 0.000 / 0.131 / 0.000 sec
Active Host State Change: 0.000 / 0.000 / 0.000 %
Active Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Passive Host State Change: 0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Hosts Up/Down/Unreach: 2613 / 0 / 0
Hosts Flapping: 0
Hosts In Downtime: 0

> -----Original Message-----
> From: [email protected] [mailto:nagios-devel-
> [email protected]] On Behalf Of Mahesh Kunjal
> Sent: Monday, December 18, 2006 6:43 PM
> To: [email protected]; [email protected]; linux-system-
> [email protected]
> Subject: Re: [Nagios-devel] Problems with many hanging Nagios
> processes(Nagios spawning rogue nagios processes eventually
crashingNagios
> server)
>=20
>=20
>=20
> We had similar issue. We have a distributed environment with one
master
> and 4 slaves. Total number of hosts monitored are 1900+ and
> 20000+ services spread across 4 slaves.
>=20
> At times we saw 14K or more results being sent in a second from
slaves.
> This resulted in 100+ nagios processes being created.
>=20
> Changed reaper frequency to 2 seconds and played with all tunables.
> Nothing seemed to help.
>=20
> Looking at the nagios source,
> This is what I found out was happening...
>=20
> Nagios has a commands file worker thread and when it gets woken up,
looks
> if there is data in pipe(nagios.cmd), if exists, forks a child
process.
> This will be i

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked