Nagios not updating after connecting with ndoutils

binmats · Post by **binmats** » Fri Oct 07, 2011 12:21 am

Hi,

We have a central Nagios setup environment with 2 distributed servers in RHEL 6. Each distributed server monitoring around 100 servers and 2000 services. We are using 16 core CPU and 32 GB RAM server as a h/w for this.

We are using nodutils 1.47b(1.49 was giving some connecting error before) to connect to MySQL DB. But, when we connect to DB, Nagios stops updating on the server. Its acually not stopping, its working very very slow. That is, it will update for some servers. If I disconnet the nodutils, Nagios will update correctly.

I am not sure why this is happening. I think all was fine when we had small number of servers to monitor. But when the servers are increared the insert rate also increased. This might be the issue. I think ndo inserting to MySQL in a single thread from each instance. Is there any way to make the insert to multiple threads.?

I hope all my configuation are correct as it was working fine before with small number of servers.

Can any one help me to resolve this issue.?

Thanks in advance.

crfriend · Post by **crfriend** » Fri Oct 07, 2011 9:10 pm

binmats wrote:We have a central Nagios setup environment with 2 distributed servers in RHEL 6. Each distributed server monitoring around 100 servers and 2000 services. We are using 16 core CPU and 32 GB RAM server as a h/w for this.

What sort of iron is your database running on? Big Nagios installations can be very demanding on the database engines, and performance does suffer when ndoutils is in use.

There are also some other questions, and rationale for them:

1) Are all your distributed servers connecting to the same database engine? (Is the database engine so busy it can't keep up?)
2) Are all the distributed servers connecting to the same database on the same engine? (There may be locking constraints that are delaying updates and causing the active monitors to fall behind.)
3) Do all the active-checking servers start falling behind equally? (The DB engine is definitely the problem, for any of the problems above.)

I run a hybrid setup at work with three servers, each running its own MySQL database engine, all with multiple Nagios instances performing active checks, and one running the "global view" instance which accepts, displays, and alerts based on data provided by the active instances. Basically, the two servers that do not house the "global" instance deal with their own databases only, and this keeps contention for said databases to a minimum; the server housing the "global" instance, does get a lot of contention, mostly in locking, and I've had to cut back quite dramatically on the length of time that history is kept because I was seeing severe contention while deletions were occurring as old history was purged. I also had to add indices to the schema to improve the speed of said deletions.

If I had to do this again, I suspect I'd use different databases for each Nagios instance rather than lumping them all into one large database and differentiating using the "nagios_instances" table and "instance_id" column. That might well help the MySQL process better deal with locking and whatnot so different instances would not be so prone to step on one another.

We are using nodutils 1.47b(1.49 was giving some connecting error before) to connect to MySQL DB.

I fought with ndoutils 1.49b for quite a while, and I believe I finally managed to kick most of the bugs out (it segfaulted prodigiously on Solaris 10) as it runs well enough on my small environment at home, but reverted work to 1.47b mainly because it was easier (and the pressure was on).

It's also worth noting that when ndoutils is in use, the delays -- especially when restarting an instance (e.g. on configuration change) -- can cause some serious delays before the scheduler actually starts running active checks. This can have side-effects if one is using facilities like "nagiosgraph" to do trending.

Offhand, in your case, I would consider spreading the load across multiple MySQL instances and using different table prefixes for each Nagios instance that's running (e.g. "nagios1_", "nagios2_", &c.) rather than throwing them all into one big database -- this will likely be of high importance to the "rollup" instance that receives all the passive checks. Since the "rollup" instance will have *all* the hosts and services that all the active instances have, it can be queried to get up-to-the-moment data even if it's in its own database completely. Good luck with it.

Post by **jsmurphy** » Fri Oct 07, 2011 9:19 pm

Have you checked the nagios/var/nagios.log? Or the event log page? Do you see error messages with something along the lines of "Warning: The check of service 'xxxxx' on host 'xxxxxxx' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service..."

If you can see this error spammed in the logs, stop all the services (mySQL, Nagios, ndo2db, apache), use ps -ef | grep nagios to ensure nagios is truly stopped and if not kill it, start them again in the order MySQL, NDOUtils, Nagios, Apache2. This usually makes the bad things go away and tends to be the result of the server under performing.

If however it's just sitting there and doing nothing with no errors, did you already have several thousand devices and turn the DB off for any lengthy period of time? If this is the case I have seen instances where it can take up to roughly 30~40 mins sitting there doing nothing to fill the database (or so I assumed) and then begin monitoring again.

SoulA · Post by **SoulA** » Mon Oct 10, 2011 9:56 am

I know that with the Nagios instance we run we have the NDO database hosted on it's own separate server. When doing restarts or if the database is running slowly Nagios will also run slowly. This is because NDO is synchronous in nature meaning Nagios waits for each MySQL transaction to get committed before moving on with it's checks.

We typically don't have too many problems with NDO 1.49 but the one thing we do notice is it does take quite a bit of time for Nagios to restart. This is because NDO deletes entire tables and re-creates them. One thing we have done is limit the amount of data we are sending to NDO. We don't need all of the stuff that gets sent so tuning that down may help. You can figure out how to do this in the ndmod.cfg file. This http://labs.consol.de/lang/de/nagios/nd ... g-options/ may also come in handy.

Another thing we are planning on implementing shortly is the Asynchronous patch for NDO which you can find on nagios exchange under the patches directory I believe. It uses IPC queues to save the queries that need to happen and make restarts and NDO much faster in our testing but you run the risk of losing data if your queue gets too large or something of that nature.

binmats · Post by **binmats** » Wed Oct 12, 2011 4:06 am

Hi All,

Thanks for the reply and suggestions. I was little busy with works.. so couldn't reply to your posts on time.

"We are using Innodb engine and no locks seeing on the DB. We were connecting to same DB wih same table prefix called nagios_"
As crfriend posted..we move in that way with small change...Instead of creating two table prefixes, we use two database.

That is, we created a two databases called nagios and nagios1and pointed each instances to each.
Now its look like checking and inserting is more accurate than earlier.

However, we found that the instance connected to nagios1(new DB) is perfect than the other one. I think this is because the data size is more on old DB. We are planning for an archival of data from there..Let see how its going...

Thanks again for all.

Nagios Support Forum

Nagios not updating after connecting with ndoutils

Nagios not updating after connecting with ndoutils

Re: Nagios not updating after connecting with ndoutils

Re: Nagios not updating after connecting with ndoutils

Re: Nagios not updating after connecting with ndoutils

Re: Nagios not updating after connecting with ndoutils