Checks stop running randomly

Post by **gwakem** » Mon Jun 18, 2012 3:35 pm

It is offloaded. I will give that a go and let you know what happens. On a side note, I enabled debugging on the ndo2db process, and I am seeing this crop up in the logs, so I think your theory above may be correct:

Code: Select all

[1340051530.413987] [002.0] [pid=769] UPDATE nagios_conninfo SET last_checkin_time=NOW(), bytes_processed='314', lines_processed='22', entries_processed='0' WHERE conninfo_id='1952'
[1340051530.414455] [002.0] [pid=769] INSERT INTO nagios_logentries SET instance_id='1', logentry_time=FROM_UNIXTIME(1340051514), entry_time=FROM_UNIXTIME(1340051514), entry_time_usec='23591', logentry_type='262144', logentry_data='ndomod: Error writing to data sink!  Some output may get lost\.\.\.', realtime_data='1', inferred_data_extracted='1'
[1340051530.415080] [002.0] [pid=769] DELETE FROM nagios_timedevents WHERE instance_id='1' AND scheduled_time<FROM_UNIXTIME(1339965130)
[1340051530.415374] [002.0] [pid=769] DELETE FROM nagios_systemcommands WHERE instance_id='1' AND start_time<FROM_UNIXTIME(1339446730)
[1340051530.415683] [002.0] [pid=769] DELETE FROM nagios_servicechecks WHERE instance_id='1' AND start_time<FROM_UNIXTIME(1339446730)
[1340051530.416008] [002.0] [pid=769] DELETE FROM nagios_hostchecks WHERE instance_id='1' AND start_time<FROM_UNIXTIME(1339446730)
[1340051530.421885] [002.0] [pid=769] DELETE FROM nagios_eventhandlers WHERE instance_id='1' AND start_time<FROM_UNIXTIME(1337373130)
[1340051530.422345] [002.0] [pid=769] INSERT INTO nagios_logentries SET instance_id='1', logentry_time=FROM_UNIXTIME(1340051514), entry_time=FROM_UNIXTIME(1340051514), entry_time_usec='23723', logentry_type='262144', logentry_data='ndomod: Please check remote ndo2db log, database connection or SSL Parameters', realtime_data='1', inferred_data_extracted='1'

Post by **gwakem** » Mon Jun 18, 2012 4:19 pm

No luck.. checks are still getting stuck after the update of my.cnf. We track DB connections, and since we use MySQL 5.0, default should have been 100 connections, and we only hit around 70 once.. all the remainder were in the 40 - 50 range while checks were freezing. Some of the checks I found that froze stopped about 5-7 minutes after I ran an apply, which usually kicks all of the frozen ones back into gear.

When I restart ndo2db, I see this in the messages log:

Code: Select all

Jun 18 15:14:57 sidhqmonm0 nagios: ndomod: Please check remote ndo2db log, database connection or SSL Parameters

so I re-enabled debugging on the ndo2db process (since I dont see a regular logging option, and a log level of 1 for process only and 1 for verbosity wasn't generating anything but a 0 byte file,) and I'm checking the specifics of the MySQL queries that are being sent, but not seeing anything so far. Any other ideas?

Post by **gwakem** » Mon Jun 18, 2012 4:37 pm

In tailing the ndo2db.debug log for two frozen checks, I found that neither would come across the debug until I forced a recheck.

So far, Im leaning against DB corruption, it should not be a DB timeout or max connection issue, and ndo2db doesnt seem to be running into issues outside of the messages log error (which doesnt make sense to me, as its pretty vague.)

Could this be a bug in the core nagios engine's logic?

I'm going to try disabling all passive checks globally in the nagios.cfg to make sure something isn't happening that makes the checks (even though they are active) suddenly think they're passive only.

Post by **gwakem** » Mon Jun 18, 2012 5:18 pm

in the nagios.cfg:

Code: Select all

accept_passive_host_checks=0
accept_passive_service_checks=0

has been set and confirmed. The services still seemed to have passive enabled (even though they are active,) so I disabled passive checks in the templates that we assigned to the services in order to set reties, contacts, timeframes, etc. We also checked the services configs directly and passive was set to skip. There were no other locations that referenced passive checks being enabled and for some reason they still show as enabled after a apply.

Our theory is that when a check has both active and passive checks enabled, something happens to the check and it suddenly goes passive, since the behavior exhibited seems to be indicative of a passive check. The fact that we can't seem to disable passive checks for services short doing it in the service config itself via the web interface makes me wonder if this is the case.

We are going to try and script something that mass disables passive checks so we don't have to go through 4200 services. In the meantime, have you seen this behavior in your labs or other users, or is this isolated to us?

Post by **gwakem** » Mon Jun 18, 2012 6:40 pm

Disabling passive checks on services and applying still results in frozen checks. We were able to query the status.dat file to find a list of services that have matching last and next check times. We are going to use that to script forcing a manual recheck to make sure we "keep the lights on" and don't freeze on a critical check during the night.

On a side note, we noticed that the amount of frozen checks was at 5, then next check was at 235, and the third check was at 243 (this was over a span of five minutes).

Post by **gwakem** » Tue Jun 19, 2012 9:16 am

On coming in this morning, we discovered that running the script that forces a recheck on frozen services helped, but we still lost graphing for various checks overnight. In a ten minute span we see the amount of frozen checks get around 450, then the cron's script kicks off.

mguthrie · Post by **mguthrie** » Tue Jun 19, 2012 9:28 am

I don't suppose you'd be willing to send that parsing script out way so we can try and recreate this locally?

Typically if there's DB corruption, there will be a fairly substantial increase in CPU usage and also load times from the web interface. Lets try the brute force repair on the databases:

Code: Select all

service mysqld stop
myisamchk -r -f /var/lib/mysql/nagios/*.MYI
service mysqld start
service ndo2db restart

Just to rule out and oddness with the retention file, lets stop nagios, delete the retention file, and start it up again. Note that this will put all of your checks in a pending state until the new results come in, and it will clear any runtime changes like "disable notifications" for services, etc.

Code: Select all

service nagios stop
rm -f /usr/local/nagios/var/retention.dat
service nagios start

I've got a remote session later today with someone else experiencing oddness with ndoutils, I'm going to see if that sheds any light on this issue as well, but if not, we can look at doing a remote session sometime this week and seeing if we can figure out the cause of this.

Post by **gwakem** » Tue Jun 19, 2012 9:59 am

We are definitively not seeing any kind of high cpu usage on either the master or the DB server, or latency on the web interface, but we will give that a try.

We have had to stop our monitoring migration until this is resolved. We have roughly 1100 hosts and 4200 services on critical boxes currently migrated to NagiosXI that are counting on us to alert them in the event of a problem, and since the upgrade from r2.3 to r3.1, we have been unable to guarantee checks will reliably run at a defined interval, or provide reliable graph data. Not to put too fine of a point on it, but this is a pretty critical issue for us, and we need to find at least a reliable temporary solution that keeps us propped up. I know you guys are busy (believe, me, we're also familiar with busy!) but the sooner we can get a remote session, the better.

I have attached the php script we use to check the amount of frozen checks. We use a modified version of this script to kick of a recheck.

Edit: Since it wont allow me to uplaod .php, I have to email it over.

Post by **gwakem** » Tue Jun 19, 2012 11:16 am

We may have found the issue, detailed post incoming.

mguthrie · Post by **mguthrie** » Tue Jun 19, 2012 11:20 am

Understandable. If you're able to maintain with the current version we can schedule a remote session for tomorrow if you'd like. I'm available from 9:30am-3pm CST (UTC -0600). Otherwise if you need to downgrade Core and NdoUtils for the time being, the instructions are below:

Code: Select all

cd /tmp
wget http://assets.nagios.com/downloads/nagiosxi/2011/xi-2011r2.4.tar.gz
tar zxf xi-2011r2.4.tar.gz
cd nagiosxi/subcomponents/nagioscore
./upgrade
cd /tmp/nagiosxi/subcomponents/ndoutils/
./upgrade

This will recompile both the Core and Ndoutils binaries. However, unless the issue somehow persists we'll have to skip the remote session for the time being.

Do you guys have a test monitoring environment in place? If not, I might recommend it for piloting the new version. Your license covers a production install, a test install, and a backup install.

Nagios Support Forum

Checks stop running randomly

Re: Checks stop running randomly

Re: Checks stop running randomly

Re: Checks stop running randomly

Re: Checks stop running randomly

Re: Checks stop running randomly

Re: Checks stop running randomly

Re: Checks stop running randomly

Re: Checks stop running randomly

Re: Checks stop running randomly

Re: Checks stop running randomly