Nagios Support Forum

Posted: **Thu May 11, 2017 2:45 pm**

On one of the Nagios servers, we keep encountering the below error and we need to restart ndo2db service to get Nagios to work correctly for until the queue gets filled up. When the queue is full the checks are still executed and processed, but is sluggish and takes time for the database or web interface to update correctly. We have other servers with the exact same setup and we are not encountering this type of problem. What can be done to correct this?

Nagios XI Version: 5.4.2

Note:
We have offloaded the MySQL database to other servers
We are using mod_gearman to handle the service checks

May 11 10:35:53 xxxxxxx ndo2db: Warning: queue send error, retrying...
May 11 10:35:54 xxxxxxx ndo2db: Message sent to queue.
May 11 10:35:54 xxxxxxx ndo2db: Warning: queue send error, retrying...
May 11 10:35:55 xxxxxxx ndo2db: Message sent to queue.

Code: Select all

cat /proc/sys/kernel/msgmnb 
131072000

cat /proc/sys/kernel/msgmax 
131072000

cat /proc/sys/kernel/msgmni 
32768

Code: Select all

/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

Nagios Core 4.2.4
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 12-07-2016
License: GPL

Website: https://www.nagios.org
Reading configuration data...
   Read main config file okay...
   Read object config files okay...

Running pre-flight check on configuration data...

Checking objects...
        Checked 25148 services.
        Checked 3017 hosts.
        Checked 9773 host groups.
        Checked 5297 service groups.
        Checked 237 contacts.
        Checked 9733 contact groups.
        Checked 156 commands.
        Checked 9 time periods.
        Checked 0 host escalations.
        Checked 0 service escalations.
Checking for circular paths...
        Checked 3017 hosts
        Checked 0 service dependencies
        Checked 0 host dependencies
        Checked 9 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...

Total Warnings: 0
Total Errors:   0

Things look okay - No serious problems were detected during the pre-flight check

Code: Select all

ipcs

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages    
0xf2010002 0          nagios     600        131061760    127990      
0x3e010002 32769      nagios     600        131046400    127975      

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x0113fc7b 360448     root       600        1000       14                      
0x00011749 393217     nagios     600        4096       304                     

------ Semaphore Arrays --------
key        semid      owner      perms      nsems     
0x00000000 524288     apache     600        1         
0x00000000 557057     apache     600        1         
0x00000000 491522     apache     600        1         
0x00000000 589827     apache     600        1         
0x00000000 622596     apache     600        1         
0x00000000 655365     apache     600        1

Posted: **Thu May 11, 2017 2:50 pm**

It seems odd that something with an exact same setup would not have this issue. Are they literally checking the same devices and using the same version of XI? If they are checking different devices, then seeing different behavior would make sense.

I think running through https://support.nagios.com/kb/article.php?id=139 will resolve the issue for you.

Posted: **Thu May 11, 2017 3:54 pm**

Each Nagios server has different servers and checks on them. I followed the instruction on the link provided and we are still getting the same error. I did notice that the amount of messages being getting added is far more then what is getting pulled out. Which of the options controls how many messages can be processed?

Posted: **Thu May 11, 2017 4:19 pm**

chicjo01 wrote:Which of the options controls how many messages can be processed?

Well, they all do to some extent. We have one customer that has /proc/sys/kernel/msgmnb & /proc/sys/kernel/msgmax at 10x the defaults rather than the 2x we suggest in the document. Is ipcs showing the new limits? If not, you may need to reboot. If so, I'd just go with the 10x. If 10x works, you can step it down if you are concerned about it being too high.

Also, did you increase the /proc/sys/kernel/msgmni ?

Posted: **Thu May 11, 2017 4:25 pm**

I will try to increase it to 10x and see how it handles. I have also increase the msgmni as well based on the highest recommendation from the website provided. Will try it tomorrow and will let you know.

Posted: **Thu May 11, 2017 4:28 pm**

Ok, great. Just a heads up that support ends at 2pm US CT on Fridays, so just be sure to get back to us before then if anything else needs to be looked at.

If that resolves the issue, have a great weekend!

Posted: **Fri May 12, 2017 9:05 am**

I have temporary up the msgmax, msgmnb, and msgmni by 10x. The problem underlining problem is still there, by increasing the max queue. It only delays the outcome. The problem is I have too many messages getting put into the queue vs messages getting processed. Isn't the point of mod_gearman to take messages from the queue and process them, then pass the results back to the queue to be processed by Nagios? I have 300 mod_gearman workers, do I need to increase this amount?

Code: Select all

cat /proc/sys/kernel/msgmax 
1073741274

cat /proc/sys/kernel/msgmnb
1073741274

cat /proc/sys/kernel/msgmni
1240000

Posted: **Fri May 12, 2017 9:09 am**

Is the mod_gearman running locally or remotely? There are certainly some other things you can try to increase performance: https://assets.nagios.com/downloads/nag ... ios-XI.pdf

It's hard to say if anything in that document will really help this particular issue. To help in that regard, can you PM me your Profile? You can download it by going to Admin > System Config > System Profile and click the ***Download Profile*** button towards the top. If for whatever reason you *cannot* download the profile, please put the output of View System Info (5.3.4+, Show Profile if older) in the thread (that will at least get us some info). This will give us access to many of the logs we would otherwise ask for individually. If security is a concern, you can unzip the profile take out what you like, and then zip it up again. We may end up needing something you remove, but we can ask for that specifically.

After you PM the profile, please update this thread. Updating this thread is the only way for it to show back up on our dashboard.

UPDATE: profile received and shared with techs

Posted: **Fri May 12, 2017 9:38 am**

I sent you the "System Info" in a PM. I tried to upload the Profile, but it is 1,111 KB, which appearly is too big for your PM system.

Posted: **Fri May 12, 2017 11:24 am**

You have a massive amount of hosts+services, so I think that's ultimately the issue. You could up the check interval on any that are not critical infrastructure. You could also move as many as possible to passive checks. Although I didn't go through all of the templates, it looks like you might just have one passive service.

Your service_check_timeout (and host) default is set to 180, which might be a little high. If you have processes that are taking that long, that could be a sign of other issues.

Your perfdata is timing out a lot, so I would definitely suggest a ramdisk for that: https://assets.nagios.com/downloads/nag ... giosXI.pdf

The profile wasn't able to pull your nagios.log, so I don't have a couple picture of things. If the ramdisk and other suggestions do not clear things up for you, can you send the output of tail -100 /usr/local/nagios/var/nagios.log

Also, I'm curious if 172.24.225.63 is your mod_gearman server.

Nagios Support Forum

ndo2db

ndo2db

Re: ndo2db

Re: ndo2db

Re: ndo2db

Re: ndo2db

Re: ndo2db

Re: ndo2db

Re: ndo2db

Re: ndo2db

Re: ndo2db