I/O wait reported from secondary NagiosXI

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
jpipitone
Posts: 102
Joined: Tue Oct 12, 2010 1:21 pm

I/O wait reported from secondary NagiosXI

Post by jpipitone »

We have a secondary NagiosXI instance in one of our east coast offices. This NagiosXI instance monitors our other NagiosXI instance on the west coast.

We have been noticing the west coast NagiosXI (the one with the i/o wait) has been reporting various sites and services as critical and / or flapping, when in fact everything is up, with no issue.

Are there any tweaks that we can make to improve performance, and cut down on the i/o waits?

Currently, NagiosXI is reporting the following of our primary NagiosXI instance:

Critical: I/O Wait = 75.15%
Load Critical: load1=10.92, load5=24.86, load15=22.92
NagiosXI Jobs: Error: Could not parse XML from http://nagiosserver/nagiosxi ()

Any help would be appreciated.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: I/O wait reported from secondary NagiosXI

Post by abrist »

If the wait gets bad enough, checks will start to timeout - that is my guess as to why you are getting inconsistent false alerts.
1) What type of disk subsystem is the server using?
2) How many cores/threads and and much ram?
3) How many checks are you running/scheduling every 5 minutes?
jpipitone wrote:NagiosXI Jobs: Error: Could not parse XML from http://nagiosserver/nagiosxi ()
This could be an unrelated issue, but we will have to get the resource usable under control before we can test this problem.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: I/O wait reported from secondary NagiosXI

Post by tmcdonald »

Our two big optimizations are implementing a ramdisk and offloading the DB:

http://assets.nagios.com/downloads/nagi ... giosXI.pdf
http://assets.nagios.com/downloads/nagi ... Server.pdf

Both will lower the IO, and the DB offloading also helps drop the CPU load a bit.
Former Nagios employee
jpipitone
Posts: 102
Joined: Tue Oct 12, 2010 1:21 pm

Re: I/O wait reported from secondary NagiosXI

Post by jpipitone »

Specs:

1 quad core Intel Xeon, 2.4ghz
4gb physical memory
3 x 7200 rpm disks (raid 1, 1 hotspare)

We have about 1022 service checks. We perform checks every minute. The checks are for various switches, servers, websites, a few DNS queries, etc.

This installation has been running fine for years, and then when we started replacing legacy Nagios Core checks with NagiosXI checks, we noticed the I/O issues.

I will start with configuring a RAM disk. With only 249mb free of 4096mb of physical memory, this may be a challenge.
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: I/O wait reported from secondary NagiosXI

Post by slansing »

I would highly recommend bumping your memory up a bit, 249m is likely not a safe number for processing check results and performance data through a ramdisk on an installation with roughly 2000 host/service checks combined. You can give it a shot, but my recommendation stands :).
jpipitone
Posts: 102
Joined: Tue Oct 12, 2010 1:21 pm

Re: I/O wait reported from secondary NagiosXI

Post by jpipitone »

slansing wrote:I would highly recommend bumping your memory up a bit, 249m is likely not a safe number for processing check results and performance data through a ramdisk on an installation with roughly 2000 host/service checks combined. You can give it a shot, but my recommendation stands :).
Thanks. Any recommendation on how large to make the RAM disk given we have 249mb free at this time? Looks like the PDF recommends 50mb? Also, this is a 32 bit operating system. If we added more physical mem, would NagiosXI even be able to utilize it if it's 32 bit?

The files added up together are about 40mb. I am going to start with 100 and see how it runs - unless you object.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: I/O wait reported from secondary NagiosXI

Post by scottwilkerson »

Actually the PDF makes the following recommendation
Now we have to actually mount a RAM disk to that location. This is the point where we need to determine the size of the RAM disk that
will be set aside. This can be determined by taking a look at your current status.dat and objects.cache which default to being in the
/usr/local/nagios/var/
directory. On my test machine, these two files added up to be around 13MB, so I will set up the RAM disk to be
50MB to give some leeway and allow for growth. This will only make an improvement if you have enough available memory, otherwise
this will mount the RAM disk and use swap memory for excess RAM allocated
I would say at a minimum 4X the size of those files...
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
jpipitone
Posts: 102
Joined: Tue Oct 12, 2010 1:21 pm

Re: I/O wait reported from secondary NagiosXI

Post by jpipitone »

OK. After creating the ram disk, I'm seeing fewer I/O alerts, however NagiosXI is running super slow. The interface is basically non-responsive.

I can click on All Service Problems or All Host Problems, and they eventually display. If I click on Service Details, the page doesn't display. I don't even get a timeout.

If we added more physical memory to this server, will NagiosXI be able to take advantage of more than 4gb, considering the OS is only 32 bit?
jpipitone
Posts: 102
Joined: Tue Oct 12, 2010 1:21 pm

Re: I/O wait reported from secondary NagiosXI

Post by jpipitone »

Just an update - this morning it seems to be running OK. I ran the database repair script. Still lagging quite a bit, but at least I can work in Nagios now
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: I/O wait reported from secondary NagiosXI

Post by slansing »

Lagging you say? What is the output of the following:

Code: Select all

free -m

df -h

top
I'm wondering if something went wrong with the ramdisk creation and you are being strained for memory right now.
Locked