Re: [Nagios-devel] Nagios and Gearman - huge environment performan=
Posted: Wed Aug 24, 2011 1:37 pm
--_000_4288A518A157EC4C8873FEE74F778BF0024D29WPSDGQHHOPRSTATEF_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
I noticed from the output you have a high amount of unknown and critical se=
rvices. Are those taking a long time to timeout? What you might try, whic=
h I know isn't ideal, but removing certain checks that might be failing, li=
ke just start with host checks, and when those show good, add a few more se=
rvices, few more, etc. until you notice the time going through the roof aga=
in. That might help figure out where your threshold is, and if there are c=
ertain checks that are causing issues. Is this a physical or virtual serve=
r?
Dan
From: Rodney Ramos [mailto:[email protected]]
Sent: Wednesday, August 24, 2011 9:26 AM
To: Nagios Developers List
Subject: Re: [Nagios-devel] Nagios and Gearman - huge environment performan=
ce problem
Hi Sven. Thank you again. I=B4m pretty sure that my check interval is 15 mi=
n, for both, hosts and services. I=B4ve set this in the templates.cfg file =
(see below). I sending too the nagiostats output. I agree with you that if =
we divide 100 k checks / 15 min ~ 111 checks/sec, but the problem is that N=
agios does not make these checks smoothly during the time. Thats the proble=
m.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
templates.cfg
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
define host{
name generic-host
...
check_interval 15
....
}
define service{
name generic-service
...
normal_check_interval 15
....
}
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
nagiostats output
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Nagios Stats 3.2.3
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 10-03-2010
License: GPL
CURRENT STATUS DATA
------------------------------------------------------
Status File: /usr/local/nagios/var/status.dat
Status File Age: 0d 0h 0m 17s
Status File Version: 3.2.3
Program Running Time: 0d 17h 43m 2s
Nagios PID: 18854
Used/High/Total Command Buffers: 0 / 0 / 4096
Total Services: 68206
Services Checked: 68206
Services Scheduled: 68206
Services Actively Checked: 68206
Services Passively Checked: 0
Total Service State Change: 0.000 / 43.880 / 2.774 %
Active Service Latency: 40.671 / 503.137 / 234.919 sec
Active Service Execution Time: 0.003 / 24.737 / 2.527 sec
Active Service State Change: 0.000 / 43.880 / 2.774 %
Active Services Last 1/5/15/60 min: 0 / 2897 / 35932 / 68206
Passive Service Latency: 0.000 / 0.000 / 0.000 sec
Passive Service State Change: 0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min: 0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit: 46943 / 56 / 7660 / 13547
Services Flapping: 980
Services In Downtime: 0
Total Hosts: 34103
Hosts Checked: 34103
Hosts Scheduled: 34103
Hosts Actively Checked: 34103
Host Passively Checked: 0
Total Host State Change: 0.000 / 63.820 / 2.598 %
Active Host Latency: 0.000 / 474.337 / 247.944 sec
Active Host Execution Time: 0.000 / 20.354 / 2.033 sec
Active Host State Change: 0.000 / 63.820 / 2.598 %
Active Hosts Last 1/5/15/60 min: 0 / 5936 / 29437 / 34103
Passive Host Latency: 0.000 / 0.000 / 0.000 sec
Passive Host State Change: 0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Hosts Up/Down/Unreach: 23591 / 10512 / 0
Hosts Flapping: 597
Hosts In Downtime: 0
Active Host Checks Last 1/5/15 min: 3 / 89 / 209
Scheduled:
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: odney Ramos [mailto:[email protected]
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
I noticed from the output you have a high amount of unknown and critical se=
rvices. Are those taking a long time to timeout? What you might try, whic=
h I know isn't ideal, but removing certain checks that might be failing, li=
ke just start with host checks, and when those show good, add a few more se=
rvices, few more, etc. until you notice the time going through the roof aga=
in. That might help figure out where your threshold is, and if there are c=
ertain checks that are causing issues. Is this a physical or virtual serve=
r?
Dan
From: Rodney Ramos [mailto:[email protected]]
Sent: Wednesday, August 24, 2011 9:26 AM
To: Nagios Developers List
Subject: Re: [Nagios-devel] Nagios and Gearman - huge environment performan=
ce problem
Hi Sven. Thank you again. I=B4m pretty sure that my check interval is 15 mi=
n, for both, hosts and services. I=B4ve set this in the templates.cfg file =
(see below). I sending too the nagiostats output. I agree with you that if =
we divide 100 k checks / 15 min ~ 111 checks/sec, but the problem is that N=
agios does not make these checks smoothly during the time. Thats the proble=
m.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
templates.cfg
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
define host{
name generic-host
...
check_interval 15
....
}
define service{
name generic-service
...
normal_check_interval 15
....
}
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
nagiostats output
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Nagios Stats 3.2.3
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 10-03-2010
License: GPL
CURRENT STATUS DATA
------------------------------------------------------
Status File: /usr/local/nagios/var/status.dat
Status File Age: 0d 0h 0m 17s
Status File Version: 3.2.3
Program Running Time: 0d 17h 43m 2s
Nagios PID: 18854
Used/High/Total Command Buffers: 0 / 0 / 4096
Total Services: 68206
Services Checked: 68206
Services Scheduled: 68206
Services Actively Checked: 68206
Services Passively Checked: 0
Total Service State Change: 0.000 / 43.880 / 2.774 %
Active Service Latency: 40.671 / 503.137 / 234.919 sec
Active Service Execution Time: 0.003 / 24.737 / 2.527 sec
Active Service State Change: 0.000 / 43.880 / 2.774 %
Active Services Last 1/5/15/60 min: 0 / 2897 / 35932 / 68206
Passive Service Latency: 0.000 / 0.000 / 0.000 sec
Passive Service State Change: 0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min: 0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit: 46943 / 56 / 7660 / 13547
Services Flapping: 980
Services In Downtime: 0
Total Hosts: 34103
Hosts Checked: 34103
Hosts Scheduled: 34103
Hosts Actively Checked: 34103
Host Passively Checked: 0
Total Host State Change: 0.000 / 63.820 / 2.598 %
Active Host Latency: 0.000 / 474.337 / 247.944 sec
Active Host Execution Time: 0.000 / 20.354 / 2.033 sec
Active Host State Change: 0.000 / 63.820 / 2.598 %
Active Hosts Last 1/5/15/60 min: 0 / 5936 / 29437 / 34103
Passive Host Latency: 0.000 / 0.000 / 0.000 sec
Passive Host State Change: 0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Hosts Up/Down/Unreach: 23591 / 10512 / 0
Hosts Flapping: 597
Hosts In Downtime: 0
Active Host Checks Last 1/5/15 min: 3 / 89 / 209
Scheduled:
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: odney Ramos [mailto:[email protected]