Too High Service Check Latency on Nagios CORE

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
krpanna
Posts: 5
Joined: Tue Feb 12, 2013 10:55 am

Too High Service Check Latency on Nagios CORE

Post by krpanna »

Hello Team,

We are facing serious performance issue and too high service check latency(7696 secs) on nagios core. The checks are not happening on time and force check also is not working. we have recently enabled/disabled below parameters. But situation is remain same. It would be more helpful if some one helps on below issue.

use_large_installation_tweaks=1
check_result_reaper_frequency=5
enable_flap_detection=0
enable_environment_macros=0

Details of Nagios Core used in our infrastructure :
Nagios version : 3.5.0
pnp4nagios version : 0.6.21
snmp traps : yes
snmp polling : yes
ndo2db : yes
nsca : yes
H/W details : Linux 2.6.32-504.3.3.el6.x86_64 Virtual machine, 4 core, 30GB RAM, 4GB
Number of hosts: 2113
Number of service checks: 16649



Monitoring Performance
Service Check Execution Time: 0.00 / 762.76 / 21.544 sec
Service Check Latency: 0.00 / 125337.24 / 7696.736 sec
Host Check Execution Time: 0.00 / 31.06 / 6.688 sec
Host Check Latency: 0.00 / 9120.72 / 7527.235 sec
# Active Host / Service Checks: 1749 / 13534
# Passive Host / Service Checks: 543 / 3185


Thanks
Prasanna
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: Too High Service Check Latency on Nagios CORE

Post by jdalrymple »

You are running a pretty substantial environment without offloading any of the work. It's not surprising that you're running into performance problems with that environment. How to fix:

The easy route would be to offload some work using passive checks or gearman workers, the latter being a fairly trivial way to improve performance.

Another big helper would be to offload your ndodb if it is onboard. We have documentation for doing that with NagiosXI that could probably be used as a basic framework for doing it with your 3.5 core environment:

Lastly - and this is *probably* the first thing you should do - upgrade Nagios core to 4.x. The 4 major release of Core introduced some MAJOR performance enhancements.
krpanna
Posts: 5
Joined: Tue Feb 12, 2013 10:55 am

Re: Too High Service Check Latency on Nagios CORE

Post by krpanna »

Hello jdalrymple,

Thanks for your reply. Yes, we have offload some services check to passive and we have enabled the passive service check for 543 hosts and 3185 Service Checks.After enabling passive check, nagios server is doing only passive checks and active checks are not happening. This is our main problem.

We are planing to upgrade Nagios version asap and also i couldn't open the given link. We are in the process of getting Nagios core support for that i got account no and we are going to do the payment asap.

Please help me on this to get rid of it.

Thanks
Prasanna
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: Too High Service Check Latency on Nagios CORE

Post by jdalrymple »

Are you saying the problem got worse when you offloaded a bunch of checks to be passive?

It doesn't really matter. Have you implemented gearman workers or offloaded your database? Those are the 2 other easiest things to do besides upgrading.

If you've done all of these, (gearman and offloading your database) it's time to start looking at optimizing individual checks... That will require looking at how each service check happens individually. Remove or restructure any services that rely on .pl or .py scripts. Etc.
krpanna
Posts: 5
Joined: Tue Feb 12, 2013 10:55 am

Re: Too High Service Check Latency on Nagios CORE

Post by krpanna »

Hello jdalrymple,

Thanks for your reply again, we are already offloading the data to Mysql DB using NDOUtils. We are started facing the issue after enabling the passive checks. So i am planing to implement the gearman now. But here our problem its production server and one nagios core instance being monitor more than 2k hosts. If you have any procedure or steps to implement the gearman. I got some links and its differs with each other.

https://labs.consol.de/nagios/mod-gearman/
http://labs.consol.de/lang/en/nagios/mo ... art-guide/
http://hasin.me/2013/10/30/installing-g ... om-source/
http://www.devops.zone/monitoring/distr ... ple-nodes/
https://wiki.icinga.org/display/howtos/ ... and+Debian

It would be really more help if you share the steps.

Thanks
Prasanna
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: Too High Service Check Latency on Nagios CORE

Post by jdalrymple »

krpanna wrote:Hello jdalrymple,

Thanks for your reply again, we are already offloading the data to Mysql DB using NDOUtils.
I meant offloading your db to a totally different host - are you doing that? Without that you're actually adding load with ndoutils, not reducing it.
krpanna wrote:If you have any procedure or steps to implement the gearman. I got some links and its differs with each other.
<snip>
It would be really more help if you share the steps.
Gearman installation is simple once you understand what gearman does. You might want to play with it in a test lab so you "get it" prior to deploying it on your production system. Use our documentation for NagiosXI just pay particular attention to the version numbers. Also note you may have to install some dependencies such as epel, gcc, automake, etc. You'll want to do the section "Server Installation (Nagios XI 2012 / Core 3.x)" on your Nagios server and then "Worker Installation (For use with a Master Server Running XI 2012 / Core 3.x)" on your woker server(s). The installation script will modify your nagios.cfg so make sure you have a backup of that, if things go awry you can revert that file and everything will be back to normal. You can consider the worker server stuff disposable, if it goes haywire just start from scratch with a fresh system. Don't forget that any checks that you run on the Nagios server will have to be run on the worker server for everything to function properly - this means any plugins, authentication or configuration files you use will need to be installed and such. You can parse this all out using hostgroups or servicegroups if you have a lot of customizations setup for some host/service checks.

Like I said - understanding how Gearman works is almost all of the battle. Installing it and making it work once you know what it does is easy as pie.

Good luck.
krpanna
Posts: 5
Joined: Tue Feb 12, 2013 10:55 am

Re: Too High Service Check Latency on Nagios CORE

Post by krpanna »

Hi dalrymple,

I have successfully integrated mod_gearman with my nagios instance. But its creating the queue for all hosts & services in initial stage after some time its only doing the passive check and mod_gearman is not putting the all the(active & passive checks) into the queue and its only performing the passive checks.

do i need to make change in any settings? how long once the mod_gear will create hosts & service queue?

And also i am getting the host down alert when host check in the queue. Worker is running for host check queue and service check queue is waiting for worker until host check queue. How to divide the worker equally to both host & service.

Host Down[04-26-2015 22:13:23] HOST ALERT: DOWN;SOFT;1;(host check orphaned, is the mod-gearman worker on queue 'host' running?)
Host Down[04-26-2015 22:13:23] HOST ALERT: DOWN;SOFT;1;(host check orphaned, is the mod-gearman worker on queue 'host' running?)
Host Down[04-26-2015 22:13:23] HOST ALERT: DOWN;SOFT;1;(host check orphaned, is the mod-gearman worker on queue 'host' running?)
Host Down[04-26-2015 22:13:23] HOST ALERT: DOWN;SOFT;1;(host check orphaned, is the mod-gearman worker on queue 'host' running?)
Host Down[04-26-2015 22:13:23] HOST ALERT: DOWN;SOFT;1;(host check orphaned, is the mod-gearman worker on queue 'host' running?)
Host Down[04-26-2015 22:13:23] HOST ALERT: DOWN;SOFT;1;(host check orphaned, is the mod-gearman worker on queue 'host' running?)
Host Down[04-26-2015 22:13:23] HOST ALERT: DOWN;SOFT;1;(host check orphaned, is the mod-gearman worker on queue 'host' running?)
Host Down[04-26-2015 22:13:23] HOST ALERT: DOWN;SOFT;1;(host check orphaned, is the mod-gearman worker on queue 'host' running?)
Host Down[04-26-2015 22:13:23] HOST ALERT: DOWN;SOFT;1;(host check orphaned, is the mod-gearman worker on queue 'host' running?)
Host Down[04-26-2015 22:13:23] HOST ALERT: DOWN;SOFT;1;(host check orphaned, is the mod-gearman worker on queue 'host' running?)
Host Down[04-26-2015 22:13:23] HOST ALERT: DOWN;SOFT;1;(host check orphaned, is the mod-gearman worker on queue 'host' running?)
Host Down[04-26-2015 22:13:23] HOST ALERT: DOWN;SOFT;1;(host check orphaned, is the mod-gearman worker on queue 'host' running?)
Host Down[04-26-2015 22:13:23] HOST ALERT: DOWN;SOFT;1;(host check orphaned, is the mod-gearman worker on queue 'host' running?)
Host Down[04-26-2015 22:13:23] HOST ALERT: DOWN;SOFT;1;(host check orphaned, is the mod-gearman worker on queue 'host' running?)

Thanks
Prasanna
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: Too High Service Check Latency on Nagios CORE

Post by jdalrymple »

This sounds like the results you'd have if you didn't actually have a worker running. Let's see the output of gearman_top.
Locked