Is ndomod.o needed if you use mod_gearman

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
lexisnexis
Posts: 27
Joined: Wed Dec 30, 2015 3:19 pm

Is ndomod.o needed if you use mod_gearman

Post by lexisnexis »

NagiosXI: 5.5.2
Gearmand: 1.1.12
Mod_Gearman_Worker: 3.0.9

I have been experiencing the mqueue is causing a bottleneck for the mod_gearman. The bottleneck I have been experiencing is, when a worker orphans a monitor check the monitor is getting stuck in the mqueue and not getting cleared out. This built up causes the mqueue to get fulled and then Nagios is unable to accept any new requests. If I comment out the broker_module for ndomod.o in the nagios.cfg. I end up bypassing the mqueue and the monitors appear to work fine without issue, but I am not 100% sure if database is getting updated properly.

So I wanted to know if using ndomod.o is still needed when using mod_gearman?

Code: Select all

/usr/local/nagios/etc/nagios.cfg

# NDOUtils module
broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg

broker_module=/usr/local/lib/mod_gearman/mod_gearman_nagios4.o config=/usr/local/etc/mod_gearman/module.conf
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Is ndomod.o needed if you use mod_gearman

Post by tgriep »

Yes, the ndomod broker modules is still required in the nagios.cfg file even with Mod Gearman enabled.
The ndo module is used by Nagios to write the status data and then some to the MYSQL database which the XI server uses to run.
Have you check the Gearman Worker log file to see what it is orphaning a check?

Code: Select all

/var/log/mod_gearman/mod_gearman_worker.log
Be sure to check out our Knowledgebase for helpful articles and solutions!
lexisnexis
Posts: 27
Joined: Wed Dec 30, 2015 3:19 pm

Re: Is ndomod.o needed if you use mod_gearman

Post by lexisnexis »

The logs show timeouts for the servicechecks, which is what I would expect for the monitor to be marked as an orphan. Is there a way to bypass the mqueue and use mod_gearman with ndoutils instead?

Code: Select all

ipcs

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages    
0x03000002 1966080    nagios     600        524288000    512000      

mod_gearman_worker log right before and after the mqueue gets filled up.

Code: Select all

[2019-06-06 08:36:37][66758][INFO ] timeout (120s) hit for servicecheck: X.X.X.82 - Raid Status
[2019-06-06 08:37:52][46658][INFO ] timeout (120s) hit for servicecheck: X.X.X.104 - Raid Status
[2019-06-06 08:37:55][136475][INFO ] timeout (120s) hit for servicecheck: X.X.X.49 - Drive Space
[2019-06-06 08:38:57][43685][INFO ] no checks in 2minutes, restarting all workers
[2019-06-06 08:40:58][43685][INFO ] no checks in 2minutes, restarting all workers
...
...
...

Code: Select all

ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256502
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 10000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 8192
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Code: Select all

cat /etc/sysctl.conf
# System default settings live in /usr/lib/sysctl.d/00-system.conf.
# To override those settings, enter new settings here, or in an /etc/sysctl.d/<name>.conf file
#
# For more information, see sysctl.conf(5) and sysctl.d(5).
kernel.msgmnb = 524288000
kernel.msgmax = 524288000
kernal.msgmni = 1024000
kernel.shmmax = 4294967295
kernel.shmall = 268435456
kernel.randomize_va_space = 2
net.ipv4.ip_forward = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.default.send_redirects = 0
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.all.secure_redirects = 0
net.ipv4.conf.default.secure_redirects = 0
net.ipv4.conf.all.log_martians = 1
net.ipv4.conf.default.log_martians = 1
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.icmp_ignore_bogus_error_responses = 1
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.tcp_syncookies = 1
net.ipv6.conf.all.accept_ra = 0
net.ipv6.conf.default.accept_ra = 0
net.ipv6.conf.all.accept_redirects = 0
net.ipv6.conf.default.accept_redirects = 0
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Is ndomod.o needed if you use mod_gearman

Post by tgriep »

How many hosts and services is the nagios server running?

Did you increase the default timeout in the nagios.cfg file to 120 seconds?

Run this on the nagios server and post the output to the ticket.

Code: Select all

grep time /usr/local/nagios/etc/nagios.cfg
Also, check the Mod Gearman worker configuration file to see if the timeout is set to either 60 or 120 seconds.

Code: Select all

job_timeout=60
Mod Gearman and NDO are 2 separate applications and they do not interact with each other.
As far as I know, Mod Gearman does not use the Kernel Message Queue.

What I think could be happening is that the Nagios server is running a lot of checks at 1 minute interval and when they fail on the Gearman worker, they stay in memory for 120 seconds. Then the next check runs a minute later and add another check in memory and so on until the memory / queue is filled up.

If you have a lot is frequent service check failures that run at 1 minute, change that to 2 to 3 minutes and that should help out.


Usually when the Kernel Message Queue fills up is that the ndo2db daemon cannot write the data to the MYSQL database fast enough.
Check the MYSQL database log files for any errors as well as the /var/log/messages files for any errors when the issue happens.
Be sure to check out our Knowledgebase for helpful articles and solutions!
lexisnexis
Posts: 27
Joined: Wed Dec 30, 2015 3:19 pm

Re: Is ndomod.o needed if you use mod_gearman

Post by lexisnexis »

We increased the mod-gearman-worker were increased because our admins told us, that a number of monitors will run slow and can take up to 120 seconds to response with a successful response. So we increased the worker timeout to 180 seconds, but the worker will only orphan the monitor check after the timeout. Most of the monitors we have in this Nagios server have an interval of 5 or 10 mins, we have only a few which run every 2 mins.

I talked with our DBA and they reviewed the mysql servers that the Nagios server is using and have reported that the system and database is healthly and everything is running fine.

Our mod-gearman-worker timeout:

Code: Select all

job_timeout=180

Code: Select all

Checking objects...
        Checked 53461 services.
        Checked 6334 hosts.
        Checked 93 host groups.
        Checked 13 service groups.
        Checked 19 contacts.
        Checked 14 contact groups.
        Checked 281 commands.
        Checked 19 time periods.
        Checked 0 host escalations.
        Checked 0 service escalations.
Checking for circular paths...
        Checked 6334 hosts
        Checked 0 service dependencies
        Checked 0 host dependencies
        Checked 19 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...

Code: Select all

grep time /usr/local/nagios/etc/nagios.cfg
#cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg
cfg_file=/usr/local/nagios/etc/timeperiods.cfg
event_handler_timeout=30
host_check_timeout=30
max_check_result_reaper_time=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5
service_check_timeout=120
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Is ndomod.o needed if you use mod_gearman

Post by tgriep »

Your system is running about 3 times the number of host and service checks where we recommend going to a physical server.
You may want to look in to that is the server is running on a virtual environment.

If you have not already done so, try implementing more performance tips from this document.

Code: Select all

https://assets.nagios.com/downloads/nagiosxi/docs/Maximizing-Performance-In-Nagios-XI.pdf
Did you find any errors in the /var/log/messages file?
Be sure to check out our Knowledgebase for helpful articles and solutions!
lexisnexis
Posts: 27
Joined: Wed Dec 30, 2015 3:19 pm

Re: Is ndomod.o needed if you use mod_gearman

Post by lexisnexis »

I checked the logs in /var/log/messages and I found a few orphan messages, which are the same messages I found in the nagios.log. Is there something specific I should be looking for in the log file?
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Is ndomod.o needed if you use mod_gearman

Post by tgriep »

The errors would probably be generated by the ndo2db process or the ndomod module.
Check for then causing errors and data sink errors.
Be sure to check out our Knowledgebase for helpful articles and solutions!
lexisnexis
Posts: 27
Joined: Wed Dec 30, 2015 3:19 pm

Re: Is ndomod.o needed if you use mod_gearman

Post by lexisnexis »

I checked the ndo2db.debug and didn't see keyword of error in the file. I have already done the Maximizing-Performance-In-Nagios-XI when the server was setup.


Jun 6 13:14:22 ndo2db: Warning: queue send error, retrying...
Jun 6 13:14:42 ndo2db: Error: max retries exceeded sending message to queue. Kernel queue parameters may need to be tuned. See README.
Jun 6 13:14:42 ndo2db: Warning: Retrying message send. This can occur because you have too few messages allowed or too few total bytes allowed in message queues. You are currently using 512000 of 32768 messages and 524288000 of 524288000 bytes in the queue. See README for kernel tuning options.
Jun 6 13:15:02 ndo2db: Error: max retries exceeded sending message to queue. Kernel queue parameters may need to be tuned. See README.
Jun 6 13:15:02 ndo2db: Warning: queue send error, retrying...
Jun 6 13:15:22 ndo2db: Error: max retries exceeded sending message to queue. Kernel queue parameters may need to be tuned. See README.
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Is ndomod.o needed if you use mod_gearman

Post by tgriep »

Could you post your Nagios XI System Profile so we can review it?
To get your system profile. Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and upload it to the forum post.

Is the MYSQL server running on the Nagios server or a remote server?

The ndo2db process cannot write the data to the MYSQL server fast enough so that is why the queue is getting filled.
We may need to look in to optimizing the MYSQL database.

Can you get the MYSQL server config files and post them here?

Code: Select all

/etc/my.cnf
and all of the files from the

Code: Select all

/etc/my.cnf.d
folder if it exists.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked