Nagios is not checking services and hosts on-time

rrkraft · Post by **rrkraft** » Sun Sep 11, 2011 7:54 pm

I'm using Nagios Core 3.2.3 with roughly 850 servers and 7800 service checks. Most of the service checks are nrpe checks, but I have also the usual ping and host ping check and a ssh check. So every server has 9 services plus the host check. I configured the ping check to run every 5 minutes, with 3 retries within a minute. The other checks run every 10 minutes with 3 retries every minute. So this actually makes a log of checks. If my math is right it has to start 15 checks per second to keep up.
I'm running this on Red Hat Enterprise Linux 5, 64bit in a VM with 4 CPUs and 4 GB memory. The server per se is not max'd out. I always have 10 to 15 % CPU idle and memory is also fine. I'm not paging.

But over the weekend I realized that my checks are actually quite behind. I changed then the 10 minute checks to 30 minute checks, which gets it down to roughly 10 checks per second.

But I still see that it runs the ping checks (one of the most important ones) only every 15 to 20 minutes. With the 3 retries Nagios alerts only 45 minutes to an hour after the server goes down. That's way too late. I would expect a page after 8 to 9 minutes at the max (normal checks every 5 minutes, 3 retries with 1 minute, so it should get a hard ping failure after 8 minutes).

Since I probably have a lot of checks I'm asking for tuning tips or trouble-shooting tips. What do you think I should change to get Nagios to run these checks really on-time? Can I get this to work?
Or is it a known that this is way too many checks for one Nagios and I will never get this to work?

I'm trying to replace Hobbit/Big Brother with Nagios. Hobbit actually can run this load just fine. I'm a huge fan of Nagios and it took me three years talking to be able to try Nagios across all servers. It would be too embarrassing if Nagios cannot handle the load which Hobbit can handle just fine without any hickup.

Any help would be greatly appreciated.

crfriend · Post by **crfriend** » Mon Sep 12, 2011 4:22 pm

rrkraft wrote:I'm using Nagios Core 3.2.3 with roughly 850 servers and 7800 service checks. Most of the service checks are nrpe checks, but I have also the usual ping and host ping check and a ssh check.

For something that size, you might want to take a peek at the Large Installation Tweaks doco; I got some extra mileage out of "elder hardware" by leveraging some of the ideas therein.

I'm running this on Red Hat Enterprise Linux 5, 64bit in a VM with 4 CPUs and 4 GB memory. The server per se is not max'd out. I always have 10 to 15 % CPU idle and memory is also fine. I'm not paging.

Is that 10 - 15% value on the physical hardware/hypervisor or the value extracted from the VM? Evaluating performance in the virtual world is not a trivial exercise, and there are many "gotchas".

But I still see that it runs the ping checks (one of the most important ones) only every 15 to 20 minutes. With the 3 retries Nagios alerts only 45 minutes to an hour after the server goes down.

Are you using a "ping" test to detect the host going down, or are you using "ping" as a "service check" in addition to the check for the host? Both? Do you have "use_aggressive_host_checking" set in your nagios.cfg file? In any event, it sounds like you have the "retry_interval" set too long if you're only getting alerted after 45 minutes to an hour.

On pings: I am known to fulminate about this, but ICMP ECHO is not a very useful protocol for detecting whether a server is up or down -- it's only really useful to see if a host is reachable. I've seen plenty of Linux systems happily sitting with a kernel panic message on the console and still have the NIC answering pings....

rrkraft · Post by **rrkraft** » Tue Sep 13, 2011 2:27 pm

crfriend wrote:
rrkraft wrote:I'm using Nagios Core 3.2.3 with roughly 850 servers and 7800 service checks. Most of the service checks are nrpe checks, but I have also the usual ping and host ping check and a ssh check.
For something that size, you might want to take a peek at the Large Installation Tweaks doco; I got some extra mileage out of "elder hardware" by leveraging some of the ideas therein.

Yes, I definitely followed that procedure. I also started to use a tmpfs for the Nagios spool directory. That may have been part of the issue. I found out that a ramdisk with ext2 is much, much faster than a tmpfs. Though a tmpfs is much faster than a real disk. Switching to a ramdisk decreased my sheduling queue delay from 45 minutes to 6 minutes!!

crfriend wrote:
I'm running this on Red Hat Enterprise Linux 5, 64bit in a VM with 4 CPUs and 4 GB memory. The server per se is not max'd out. I always have 10 to 15 % CPU idle and memory is also fine. I'm not paging.
Is that 10 - 15% value on the physical hardware/hypervisor or the value extracted from the VM? Evaluating performance in the virtual world is not a trivial exercise, and there are many "gotchas".

It's out of the VM. We're building our VM on hosts with plenty of resources left. We're not running our hosts very hot and go rather a conservative approach. If I have 15% idle, the time is really idle.

crfriend wrote:
But I still see that it runs the ping checks (one of the most important ones) only every 15 to 20 minutes. With the 3 retries Nagios alerts only 45 minutes to an hour after the server goes down.
Are you using a "ping" test to detect the host going down, or are you using "ping" as a "service check" in addition to the check for the host? Both? Do you have "use_aggressive_host_checking" set in your nagios.cfg file? In any event, it sounds like you have the "retry_interval" set too long if you're only getting alerted after 45 minutes to an hour.

I'm using host and service ping checks. I'm not using aggressive_host_checking. My retry interval is set to 1 minute. The issue is that the scheduling queue is delayed so much. Nagios just cannot handle so many things to schedule, which is very, very sad.

crfriend wrote:
On pings: I am known to fulminate about this, but ICMP ECHO is not a very useful protocol for detecting whether a server is up or down -- it's only really useful to see if a host is reachable. I've seen plenty of Linux systems happily sitting with a kernel panic message on the console and still have the NIC answering pings....

Yes, I know about the ping "gotchas". But it's a very easy and "cheap" test. Everything else is either more wrong or takes more CPU cycles.

crfriend · Post by **crfriend** » Wed Sep 14, 2011 5:53 pm

rrkraft wrote:I'm using host and service ping checks. I'm not using aggressive_host_checking. My retry interval is set to 1 minute. The issue is that the scheduling queue is delayed so much. Nagios just cannot handle so many things to schedule, which is very, very sad.

Offhand, I'd recommend scrapping the "ping" as a service check (unless you want the display of RTT in the UI) and retain it on the host check.

Other questions:

1) Are you running a database via ndoutils? I've seen that play silly buggers with the scheduler if the database cannot keep up.
2) What do you have configured in your nagios.cfg file for "max_concurrent_checks"? It's possible you're bottlenecking on that value. (I disable it by setting it to 0; this increased the load average, but interestingly didn't do much to the CPU utilisation.)

I've run a Nagios instance on a fairly long-in-the-tooth SPARC system with over a thousand hosts and almost 6000 services (albeit with a lot of work put in to optimise the checks involved) and the latency, without the database (which I now rely on), was seldom more than 2 or 3 seconds. That system is likely quite a lot slower than what's available for iron now.

As an interesting aside, based on performance-data analysis of my instances, ICMP pings were taking more time than any other general class of checks, at least on SPARC Solaris.

rrkraft · Post by **rrkraft** » Wed Sep 14, 2011 8:44 pm

I used ndoutils but turned it off to take it out of the equation. I assumed it may play a slowing down factor.
I disabled the max concurrent check counter by setting it to 0.

Right now I'm doing what you did: Moving check intervals around to lower the total number of checks per hour. I have only 850 hosts but a lot of checks per host. When I started I had roughly 60000 checks per hour. Now I'm roughly at 38000 checks an hour.
My biggest performance booster was switching from tmpfs to ramdisk with ext2 for the Nagios spool directory. I would have never thought that tmpfs is so much slower than ramdisk.
I have the system lagging right now between 10 and 90 seconds, which is not too bad. But it also means that I run at the limit. Adding more hosts or a check per server will kill me again.

BTW: Since the scheduler runs in only one thread, the speed of one Core is relevant and not how many Cores the server has. I bet I could add 10 more cores into my server and wouldn't get any increase. But if I would go to a server with less cores but faster cores I would gain. So the old machines are probably better than newer machines. They tend to have less but faster cores. Newer hardware tends to come with a lot of cores, which are often slower than in the past. They get the speed through the number of cores.

crfriend · Post by **crfriend** » Thu Sep 15, 2011 6:40 am

rrkraft wrote: Since the scheduler runs in only one thread, the speed of one Core is relevant and not how many Cores the server has.

The way to tell if you're suffering from that would be to check if one of the cores is continually pinned, and if that's the case then the only way around the problem would be to split your monolithic setup into two (or more) instances and pass the results from those into another instance that just accepts passive checks.

rrkraft · Post by **rrkraft** » Thu Sep 15, 2011 7:59 pm

crfriend wrote:
rrkraft wrote: Since the scheduler runs in only one thread, the speed of one Core is relevant and not how many Cores the server has.
The way to tell if you're suffering from that would be to check if one of the cores is continually pinned, and if that's the case then the only way around the problem would be to split your monolithic setup into two (or more) instances and pass the results from those into another instance that just accepts passive checks.

Yes, that is exactly what's happening. But wouldn't a passive check not also write into that filesystem so that the scheduler can pick it up? I would just not schedule anything new.
I'm trying to avoid splitting this up. That would require another server to maintain. That's a real draw-back against Hobbit or OpenView and will nix Nagios from the pool of potential replacements for Hobbit, which would be a shame. It's just such a good monitoring program. I hate to dump it due to something like that. Why did never someone fix this drawback? What I'm reading I'm not the only one running into this limitation. 850 servers and 30000 is not a lot for a large company. We have over 3000 servers. I cannot even think how many instances I would need to monitor this. Hobbit and OpenView do that just fine ....

rrkraft · Post by **rrkraft** » Mon Sep 19, 2011 10:51 am

I think you have a very good point with the ping tests if done in a separate service check. I have plenty of other service checks for a host so they would discover also that the server is down. That then would trigger also the host check and I get my host down alert that way. I would need the ping service check only if I don't have any other service check for a server.

So I disabled the ping service check now. That removes a lot of checks per hour in the scheduling queue.

crfriend · Post by **crfriend** » Tue Sep 20, 2011 5:04 pm

rrkraft wrote:So I disabled the ping service check now. That removes a lot of checks per hour in the scheduling queue.

I do a lot of my development work on very modest hardware, and that forces me to get rather "creative" with optimisation. I know this sounds a lot like masochism, but if I can get a moderately large installation running nicely on 8 year old iron it'll run very nicely on modern kit.

Certainly removing "ping as a separate service" makes sense in this regard as the "ping" (usually as "check_host_alive" or somesuch) is frequently implicit in the host check, and if one is looking for RTT one can look at the host-check output.

Other opportunities may lurk in combining multiple checks into one, especially if the scheduler is saturated, or by transitioning some checks into passive checks and scheduling those via "cron" or some other mechanism. This can be especially useful if you're generating graphs with plugin output and the main scheduler cannot keep up. For instance, I wrote a custom mechanism to check the overall "health" of *NIX hosts based on SNMP queries (to the NET-SNMP agents running thereon) for a number of load, memory, and CPU parameters and then passed the weighted results into Nagios and stashed a "state file" which was separately used to populate RRDs offline from Nagios. This replaced what might have been 8 or 10 separate checks that the Nagios scheduler would have to deal with (per host) by a single return value as a passive check, thereby completely relieving the scheduler of having to deal with it.

Too, lots of people tend to throw massive numbers of checks at hosts and services that may not provide the level of insight intended. I am not intimating that your setup is done so, but occasionally revisiting individual checks for efficacy and relevance can yield interesting benefits. Minimalism is sometimes your friend.

Nagios Support Forum

Nagios is not checking services and hosts on-time

Nagios is not checking services and hosts on-time

Re: Nagios is not checking services and hosts on-time

Re: Nagios is not checking services and hosts on-time

Re: Nagios is not checking services and hosts on-time

Re: Nagios is not checking services and hosts on-time

Re: Nagios is not checking services and hosts on-time

Re: Nagios is not checking services and hosts on-time

Re: Nagios is not checking services and hosts on-time

Re: Nagios is not checking services and hosts on-time