XI and Core out of Sync

esendex · Post by **esendex** » Wed Apr 22, 2015 5:16 am

Hi,

We are currently experiancing problems with our Nagios XI installation.

Server Details
- System:
Nagios XI Version : 2014R2.6
NAGIOS-01.datacentre.esendex.com 2.6.32-504.12.2.el6.x86_64 x86_64
CentOS release 6.6 (Final)
Gnome is not installed
- VMWare Image
- No Special Configurations

There are discrepancies between the check times displaying through the Nagios XI interface and the core interface's check schedule.

The XI interface shows the last checked, and next check times are well in the past. In fact they never seem to update (screen shots attached), even though the core installation behind XI seesm to be working and alerting correctly.

20150422-104414.png

20150422-104356.png

Other issues that I think relate to this are things such as acknowledging an issue in XI does not show up, but does appear to have taken affect in the core status view.

So far I have made sure the time is consistant through out the Nagios box (System, PHP, MySQL), tried running the repairmysql.sh script following nagios documentation (http://goo.gl/Pxnkga), removing retention.dat file and restarting, to try and reschedule everything.

I increased the kernel.msgmax as we seemed to be hitting that limit, I'm not sure why.

Code: Select all

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages    
0x1f000002 0          nagios     600        902465536    881314

I'm a bit stuck with where to go now. Possibly database corruption. No idea.

Any help would be greatly appreciated.

Cheers,

Andy

EDIT:
Also this in the NagiosXI monitoring engine status seems to be empty, and occasionally doesn't recognise that "Monitoring Engine Process" is running

20150422-111955.png

esendex · Post by **esendex** » Wed Apr 22, 2015 11:50 am

Curiously we 'nuked' the kernel message queue using icqrm -Q earlier due to it having filled up which we saw through

Code: Select all

Apr 22 15:39:18 NAGIOS-01 ndo2db: Message sent to queue.
Apr 22 15:39:18 NAGIOS-01 ndo2db: Warning: queue send error, retrying...

and following this, briefly, Nagios XI was showing a more accurate 'Last Check' time now. We've now ended up with three kernel message queues marked as 'nagios' which are filling up again.

So in terms of processes reading off that queue, where should we look for bottlenecks or signs that it isn't working as optimally as it should please?

Many thanks in advance

rseiwert · Post by **rseiwert** » Wed Apr 22, 2015 2:02 pm

Experienced a lot of this lately myself.It looks like your ndo2db is being choked. I would look for checks returning huge results. You also cannot trust those green check boxes. To see when this was happening I added a check freshness indicator which at least lets me know when a scheduled check is missed http://support.nagios.com/forum/viewtop ... 16&t=32302

It seems a silly question, have you tried rebooting the server instead of just restarting services?
I would down the services to clean up any "left over" pid files. Then reboot. In this thread I experienced the same issue. http://support.nagios.com/forum/viewtop ... 16&t=32183
One thing I discovered was that using the XI interface to reset services or even /etc/init.d would often kill off other core processes instead of the intended making troubleshooting difficult and muddying the water. Example, Last night I restarted the performance grapher from the interface and it killed off all the nagios cron jobs. This is why I recommend shutting down services and then reboot.
# /etc/init.d/npcd stop
NPCD Stopped.
# /etc/init.d/ndo2db stop
Stopping ndo2db: done.
# /etc/init.d/nagios stop
Stopping nagios: .done.
# shutdown -r now

Check everything is running

Nagios Core Collector
ps -ef | grep "nagios/bin/nagios --worker" | grep -v 'grep'
There should be 5+ processes.

Nagios Cron Jobs:
ps -ef | grep "/usr/bin/php -q /usr/local/nagiosxi/cron" | grep -v 'grep'
There should be 10+ processes

Nagios Database Backend:
ps -ef | grep "ndo2db.cfg" | grep -v 'grep'
There should be 3 processes.

Nagios Itself
ps -ef | grep '/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg' | grep -v grep
which should return 2 processes.

Check your queues
ipcs -q
should be @ zero or very near

In the end I did find a check that was return a HUGE amount of extra data. I wrapped that check inside a shell script which ran the check and then piped it through head to limit the results. Better option would have been to modify the original check to limit the results or modify the process feeding the message queue to truncate the result before shoving it into the queue.

Very interested to know what you find.

rseiwert · Post by **rseiwert** » Wed Apr 22, 2015 5:00 pm

where should we look for bottlenecks or signs that it isn't working as optimally

The first clue I had to an check gone askew was when the check returned the results from an entirely different check. Some kind of buffer overrun issue.

The second clue was I enabled debugging on the ndo2db by editing the /usr/local/nagios/etc/ndo2db.cfg.

Code: Select all

# DEBUG LEVEL
# This option determines how much (if any) debugging information will
# be written to the debug file.  OR values together to log multiple
# types of information.
# Values: -1 = Everything
#          0 = Nothing
#          1 = Process info
#          2 = SQL queries
debug_level=-1
# DEBUG VERBOSITY
# This option determines how verbose the debug log out will be.
# Values: 0 = Brief output
#         1 = More detailed
#         2 = Very detailed
debug_verbosity=2
[code]

When I looked at the ndo2db.log I noticed some larger checks but it was hard to see what was what. 
I used awk to help narrow it down
# awk '{ print length }' /usr/local/nagios/var/ndo2db.debug | sort -n
This gave me a range of the sql statement lengths in the ndo2db.debug. If nothing is over 4K this probably is not your issue. 
To show the offending check. 
# awk '{ if (length > 16000) print length " " $0}' /usr/local/nagios/var/ndo2db.debug

esendex · Post by **esendex** » Wed Apr 22, 2015 5:03 pm

Thanks for taking the time to reply. I too think that ndo2db is heavily underload.
Before following your suggestions, I went to check the Core Scheduling Queue XI_ROOT_URL/nagios/cgi-bin/extinfo.cgi?type=7 and even Core's checks had gone for around 40 minutes without update (I didn't check the kernel message queues before restarting however). Stopping ndo2db before rebooting unlocked 17 notifications that hadn't been pushed.

Reviewing what's running post-reboot:

Code: Select all

 ps -ef | grep "nagios/bin/nagios --worker" | grep -v 'grep'
nagios    1638  1634  0 22:43 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    1639  1634  0 22:43 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    1640  1634  0 22:43 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    1641  1634  0 22:43 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    1642  1634  0 22:43 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios    1643  1634  0 22:43 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh

Code: Select all

 ps -ef | grep "/usr/bin/php -q /usr/local/nagiosxi/cron" | grep -v 'grep'
nagios    5529  5524  0 22:54 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php > /usr/local/nagiosxi/var/eventman.log 2>&1
nagios    5530  5523  0 22:54 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php > /usr/local/nagiosxi/var/feedproc.log 2>&1
nagios    5531  5525  0 22:54 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php > /usr/local/nagiosxi/var/cmdsubsys.log 2>&1
nagios    5533  5530  1 22:54 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/feedproc.php
nagios    5534  5522  0 22:54 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php > /usr/local/nagiosxi/var/perfdataproc.log 2>&1
nagios    5536  5526  0 22:54 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php > /usr/local/nagiosxi/var/sysstat.log 2>&1
nagios    5537  5534  2 22:54 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php
nagios    5538  5531  2 22:54 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/cmdsubsys.php
nagios    5540  5536  2 22:54 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/sysstat.php
nagios    5542  5529  3 22:54 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/eventman.php

Code: Select all

 ps -ef | grep "ndo2db.cfg" | grep -v 'grep'
nagios    1695     1  0 22:43 ?        00:00:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
nagios    1706  1695  0 22:43 ?        00:00:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
nagios    1707  1706 91 22:43 ?        00:10:53 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg

Code: Select all

 ps -ef | grep '/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg' | grep -v grep
nagios    1634     1  1 22:43 ?        00:00:09 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios    1727  1634  0 22:43 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

Code: Select all

ipcs -q

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0xca000002 0          nagios     600        137680896    134454

So I'm now off on an hunt for large checks. This problem did start manifesting itself with the addition of a lot of check_wmi_plus checks into Nagios XI's static directory. We're attempting to add a set of new checks using source controlled releases into the static directory. The length of time some of those checks were taking & then also attempting to switch on the debug mode has probably ended up with the scenario you describe.

With a clear kernel queue is there anywhere else we can look to spot a backlog? is there any mysql table that might need cleaning up as well to give ndo2db a clear run?

Thanks for your help so far,
Jonathan

esendex · Post by **esendex** » Wed Apr 22, 2015 6:02 pm

Quick update:

After saying in my last post that I suspected check_wmi_plus to be the culprit reminded myself that we'd enabled the debug mode to try and ascertain why certain WMI checks were returning with Unknown statuses when individually (or forcing a reschedule) would work just fine.

I now suspect that we were checking too many concurrently using WMI and hitting Windows memory constraints for the WMI process. The returned NT error meant the check_wmi_plus plugin returned an Unknown status rather than the expected value.

I've now disabled the check_wmi_plus debug mode and I'm happy to announce that the kernel queue is 0 and has been for some time. Kernel queues aside it would be great to have better visibility on the performance throughput of ndo2db. I've seen a post (http://www.monitoring-portal.org/wbb/in ... eadID=9416) suggesting that enabling the MySQL log_slow_queries setting might have shown queries that were blocking the INSERTs into MySQL?

Going to keep an eye on it overnight and update in the morning.

Thanks for your help so far, rseiwert.

rseiwert · Post by **rseiwert** » Wed Apr 22, 2015 6:06 pm

Code: Select all

ps -ef | grep "ndo2db.cfg" | grep -v 'grep'
nagios    1695     1  0 22:43 ?        00:00:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
nagios    1706  1695  0 22:43 ?        00:00:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
nagios    1707  1706 91 22:43 ?        00:10:53 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg

You see that 91. That's seriously choked, don't think I have ever seen one that high.

rseiwert · Post by **rseiwert** » Wed Apr 22, 2015 6:12 pm

Here is a quick hack to limit the output of check_wmi_plus.pl. Find the subroutine combine_display_and_perfdata and add the following right before the return. I truncated at 8196 but you can tweak that number. This is for version 1.59 which might not be the version the XI wizard is using.

Code: Select all

$combined=~s/.{8196}\K.*//s;

Code: Select all

sub combine_display_and_perfdata {
my ($display,$perfdata)=@_;
# pass in
# a nagios display string
# a nagios performance data string
my $combined='';
# now build the combined string (we are providing multiple options for programming flexibility)
# we have to make sure that we follow these rules for performance data
# if there is a \n in the $display_string, place |PERFDATA just before it
# if there is no \n, place |PERFDATA at the end
# we'll try and improve this to make it a single regex - one day .....
$debug && print "Building Combined Display/Perfdata ... ";
# look for an actual \n as a single ascii character
if ($display=~/\n/) {
   $debug && print "Found LF\n";
   $combined=$display;
   # stick the perf data just before the \n
   $combined=~s/^(.*?)\n(.*)$/$1|$perfdata\n$2/s;

# now also look for an actual \ and an n ie 2 ascii characters. This can happen, for example, when \n is defined in the display= or predisplay= settings in an ini file
} elsif ($display=~/\\n/) {
   $debug && print "Found embedded LF\n";
   $combined=$display;
   # stick the perf data just before the \n, make sure we are replacing the literal \n ie 2 ascii characters
   $combined=~s/^(.*?)\\n(.*)$/$1|$perfdata\n$2/s;
} else {
   #$debug && print "No LF\n";
   $combined="$display|$perfdata\n";
}

# if there is no perfdata | will be the last character - remove | if it is at the end
$combined=~s/\|$//;

$debug && print "IN:$display|$perfdata\n";
$debug && print "OUT:$combined\n";
# rseiwert 4-22-2015
# Hack to stop killing Nagios by limiting output
$combined=~s/.{8196}\K.*//s;
return $combined;
}

rseiwert · Post by **rseiwert** » Wed Apr 22, 2015 9:11 pm

I just missed your last post. Of course turning on debug mode would overflow Nagios.

As far as your unknown issue. Have you tried increasing the timeout. I have most of mine set to 30 secs, some to 60. If you are querying large objects or interpreted WMI data it might take some time to respond. I think I did this to avoid UNKNOWNS.

I also don't think this is a MySQL issue but I'm not 100% on that yet. I really know nothing about the Nagios DB Broker but might be looking into it before long. To have a failing monitored system, as was my case, generate so much noise the NMS stops listening and continues to show green is not acceptable in anyone's book.

Edit: I'm starting to think that it is the MySQL and specifically the interaction between NDO2DB and MySQL.

Post by **tgriep** » Thu Apr 23, 2015 4:42 pm

esendex, do you have any update on your server after your change to the wmi plugin?

Nagios Support Forum

XI and Core out of Sync

XI and Core out of Sync

Re: XI and Core out of Sync

Re: XI and Core out of Sync

Re: XI and Core out of Sync

Re: XI and Core out of Sync

Re: XI and Core out of Sync

Re: XI and Core out of Sync

Re: XI and Core out of Sync

Re: XI and Core out of Sync

Re: XI and Core out of Sync