ndo2db - not processing messages fast enough

delfin · Post by **delfin** » Wed Apr 06, 2016 10:08 pm

We have configured our 9 nagios servers to send data to a MySQL server using Ndoutils. Everything is working great so far as we are able to collect data from all of the 9 nagios servers. The issue is that 4 out of the 9 nagios servers seems to be sending more information than what the ndo2db application can handle. This results to the IPC message queue limit getting maxed out ( kernel.msgmnb = 524288000 ) causing the nagios daemon to stall and ultimately stop on those 4 servers.

We have tried using ndoutils v2.0.0 and v2.1.b2 but we get the same result.

https://github.com/NagiosEnterprises/nd ... tils-2.1b2

Any help or suggestions to get around this issue would be much appreciated. Thanks guys

Below is our setup.

9 nagios servers running the following:

Red Hat : 6.7
Nagios : 4.1.1
Ndomod : 2.1b2

Code: Select all

#####################################################################
# NDOMOD CONFIG FILE
#
# Last Modified: 09-05-2007
#####################################################################
instance_name=MED-2
output_type=tcpsocket
output=<MysqlIPaddress>
tcp_port=5668
use_ssl=0
output_buffer_items=10000
buffer_file=/usr/local/nagios/var/ndomod.tmp
file_rotation_interval=14400
#file_rotation_command=rotate_ndo_log
file_rotation_timeout=60
reconnect_interval=15
reconnect_warning_interval=15
#reconnect_warning_interval=900
acknowledgement_data=1
adaptive_contact_data=0
adaptive_host_data=0
adaptive_program_data=0
adaptive_service_data=0
aggregated_status_data=0
comment_data=1
contact_status_data=0
downtime_data=1
event_handler_data=0
external_command_data=0
flapping_data=0
host_check_data=1
host_status_data=1
log_data=1
main_config_data=0
notification_data=0
object_config_data=0
process_data=0
program_status_data=1
retention_data=0
service_check_data=1
service_status_data=1
statechange_data=1
system_command_data=0
timed_event_data=0
config_output_options=0

1 MySQL server running the following:

Red Hat : 6.7
MySQL : 5.1.3
Ndo2db : 2.1b2

Code: Select all

#####################################################################
# NDO2DB DAEMON CONFIG FILE
#
# Last Modified: 01-02-2009
#####################################################################
lock_file=/usr/local/nagios/var/ndo2db.lock
ndo2db_user=nagios
ndo2db_group=nagios
#socket_type=unix
socket_type=tcp
socket_name=/usr/local/nagios/var/rw/ndo.sock
tcp_port=5668
use_ssl=0
db_servertype=mysql
db_host=localhost
db_port=5668
db_name=nagios
db_prefix=nagios_
db_user=<ndo2db_user>
db_pass=<nagiossecret>
max_timedevents_age=1440
max_systemcommands_age=10080
max_servicechecks_age=10080
max_hostchecks_age=10080
max_eventhandlers_age=44640
max_externalcommands_age=44640
max_notifications_age=44640
max_contactnotifications=44640
max_contactnotificationmethods=44640
max_logentries_age=129600
max_acknowledgements_age=44640
debug_level=-1
debug_verbosity=2
debug_file=/usr/local/nagios/var/ndo2db.debug
max_debug_file_size=1000000

tmcdonald · Post by **tmcdonald** » Thu Apr 07, 2016 10:55 am

Moving to NDO 2.1b2 was the right first step, as there were some fixes in there specifically to handle the slow processing.

How many hosts+services are being handled by each of these servers? It honestly might just be too much for the hardware. For that matter, what's the hardware look like on the servers?

delfin · Post by **delfin** » Thu Apr 07, 2016 10:15 pm

All the servers are Virtual Machines Running on a VMWARE ESX host(s) with the following specs:

Server: Proliant BL660c Gen8
Processor: Intel(R) Xeon(R) CPU E5-4650 0 @ 2.70GHz
Total CPU : 4
CPU Cores: 32
Memory: 512 GB

Here are the Nagios and MySQL server VM hardware allocations and host/service check count:

5 Stable Servers

Server: NAG01
CPU Core: 8
Memory: 8 GB
Hosts: 4646
Service Checks: 18200

Server: NAG02
CPU Core: 8
Memory: 8 GB
Hosts: 4291
Service Checks: 16606

Server: NAG03
CPU Core: 8
Memory: 8 GB
Hosts: 5114
Service Checks: 17874

Server: NAG04
CPU Core: 8
Memory: 8 GB
Hosts: 8511
Service Checks: 17224

Server: NAG06
CPU Core: 8
Memory: 8 GB
Hosts: 76
Service Checks: 196

4 Problem Servers

Server: NAG05
CPU Core: 8
Memory: 8 GB
Hosts: 14007
Service Checks: 25276

Server: NAG07
CPU Core: 8
Memory: 16 GB
Hosts: 6136
Service Checks: 24537

Server: NAG08
CPU Core: 8
Memory: 16 GB
Hosts: 6512
Service Checks: 30508

Server: NAG09
CPU Core: 8
Memory: 16 GB
Hosts: 8488
Service Checks: 30715

1 MySQL Server

Server: MYSQL01
CPU Core: 8
Memory: 16 GB

Do we need to think about offloading some of the hosts/service checks from the 4 servers? Or can you think of other options?

Thanks.

bphl · Post by **bphl** » Fri Apr 08, 2016 12:10 am

The 9 nagios servers are handling ~58000 hosts and ~181000 services devided as shown below

all servers have 8 CPU's and 16 GB of memory running as VW guests.

Code: Select all

server	hosts	services
nag01	4640	18200
nag02	4184	16606
nag03	5092	17874
nag04	8505	17224
nag05	14005	25276
nag06	76	196
nag07	6122	24537
nag08	6474	30508
nag09	8470	30715

The nagios servers which is in question is nag05, nag07, nag08 and nag09

tmcdonald · Post by **tmcdonald** » Fri Apr 08, 2016 9:11 am

delfin wrote:Do we need to think about offloading some of the hosts/service checks from the 4 servers? Or can you think of other options?

I always recommend in split setups like this keeping the load as evenly-distributed as possible. Looking at your numbers, the servers are pretty uniform in resources. With that in mind, the breaking point is likely around the 20k service mark. If you can shift some of the load into your barely-used NAG06 that could alleviate some of the stress on 5/7/8/9.

Other options would include spinning up another server, *possibly* increasing the kernel queue limits (though that might just mask/delay the problem), working on adjusting the check/retry intervals (this can actually have a big impact if done properly and thoroughly), and tweaking various performance-related options in nagios.cfg.

delfin · Post by **delfin** » Mon Apr 11, 2016 3:03 am

If the server is monitoring 10k hosts and has 10k service checks, does this mean we already hit the 20k breaking point? I'm thinking that the monitored hosts generates host checks and since we are also gathering host check data, do we need to add both the number of hosts and the number of service checks to compute if we're already reaching the 20k breaking point?

tmcdonald · Post by **tmcdonald** » Mon Apr 11, 2016 9:30 am

Sorry, yea you're correct. Adding the hosts and the services, the new breaking point seems to be between 25k (highest of the stable) and 30k (lowest of the problems). There are many other factors as well, such as frequency of check, % in non-OK states, etc. that will cause more frequent checking, but looking at the numbers you provided there is some sort of barrier related to the number.

bphl · Post by **bphl** » Tue Apr 12, 2016 4:25 am

HI, are there any documents or best pratice guidelines for scaling NDO in large environments ?

scottwilkerson · Post by **scottwilkerson** » Tue Apr 12, 2016 2:55 pm

bphl wrote:HI, are there any documents or best pratice guidelines for scaling NDO in large environments ?

Not that I know of, first and foremost, move it off of your Nagios server. Secondly, give both your monitoring server and the MySQL servers the fastest disks you can afford. If you need to scale extremely lage, you should be thinking about a raid array of SSD drives.

Finally, you can also limit the information you send to the database via the data_processing_options in the ndomod.cfg

# DATA PROCESSING OPTION
# This option determines what data the NDO NEB module will process.
# Do not mess with this option unless you know what you're doing!!!!
# Read the source code (include/ndbxtmod.h) to determine what values
# to use here. Values from source code should be OR'ed to get the
# value to use here. A value of -1 will cause all data to be processed.
# Read the source code (include/ndomod.h) and look for "NDOMOD_PROCESS_"
# to determine what values to use here. Values from source code should
# be OR'ed to get the value to use here. A value of -1 will cause all
# data to be processed.

See: https://github.com/NagiosEnterprises/nd ... e/ndomod.h for details

Post by **Box293** » Tue Apr 12, 2016 9:27 pm

scottwilkerson wrote:Finally, you can also limit the information you send to the database via the data_processing_options in the ndomod.cfg

# DATA PROCESSING OPTION
# This option determines what data the NDO NEB module will process.
# Do not mess with this option unless you know what you're doing!!!!
# Read the source code (include/ndbxtmod.h) to determine what values
# to use here. Values from source code should be OR'ed to get the
# value to use here. A value of -1 will cause all data to be processed.
# Read the source code (include/ndomod.h) and look for "NDOMOD_PROCESS_"
# to determine what values to use here. Values from source code should
# be OR'ed to get the value to use here. A value of -1 will cause all
# data to be processed.

See: https://github.com/NagiosEnterprises/nd ... e/ndomod.h for details

This KB article might also help:

https://support.nagios.com/kb/article.php?id=113

Nagios Support Forum

ndo2db - not processing messages fast enough

ndo2db - not processing messages fast enough

Re: ndo2db - not processing messages fast enough

Re: ndo2db - not processing messages fast enough

Re: ndo2db - not processing messages fast enough

Re: ndo2db - not processing messages fast enough

Re: ndo2db - not processing messages fast enough

Re: ndo2db - not processing messages fast enough

Re: ndo2db - not processing messages fast enough

Re: ndo2db - not processing messages fast enough

Re: ndo2db - not processing messages fast enough