Page 2 of 5

Re: Checks stop running randomly

Posted: Mon Jun 18, 2012 10:28 am
by gwakem
The attached pictures were taken at 09:25 MST to give you an idea of what were seeing. We disabled passive to rule out any possibility of that causing issues, but didn't force a recheck, as there was no guarantee it would freeze again.

Re: Checks stop running randomly

Posted: Mon Jun 18, 2012 10:33 am
by mguthrie
I'd like to rule out DB corruption as a possibility, lets go ahead and run the following procedure and make sure that's not causing an issue.

http://assets.nagios.com/downloads/nagi ... tabase.pdf


Also, can you do a running tail on your system log and nagios log and see if there are any ndo2db related errors showing up?

Code: Select all

tail -f /var/log/messages
tail -f /usr/local/nagios/var/nagios.log

Re: Checks stop running randomly

Posted: Mon Jun 18, 2012 11:03 am
by gwakem
We stopped nagios and ndo2db, and ran the repair:

Code: Select all

[root@sidhqmonadm0 ~]# service mysqld stop
Stopping MySQL:                                            [  OK  ]
[root@sidhqmonadm0 ~]# ps ax |grep mysqld
25900 pts/1    S+     0:00 grep mysqld
[root@sidhqmonadm0 ~]# ./repairmysql.sh nagios *
DATABASE: nagios
TABLE:    
/var/lib/mysql/nagios ~
Stopping MySQL:                                            [FAILED]
- recovering (with sort) MyISAM-table 'nagios_acknowledgements.MYI'
Data records: 2458
- Fixing index 1
          
---------

- recovering (with sort) MyISAM-table 'nagios_commands.MYI'
Data records: 156
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_commenthistory.MYI'
Data records: 28805
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_comments.MYI'
Data records: 2061
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_configfiles.MYI'
Data records: 1
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_configfilevariables.MYI'
Data records: 138
- Fixing index 1
          
---------

- recovering (with sort) MyISAM-table 'nagios_conninfo.MYI'
Data records: 1940
- Fixing index 1
          
---------

- recovering (with sort) MyISAM-table 'nagios_contact_addresses.MYI'
Data records: 0
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_contactgroup_members.MYI'
Data records: 108
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_contactgroups.MYI'
Data records: 36
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_contact_notificationcommands.MYI'
Data records: 1056
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_contactnotificationmethods.MYI'
Data records: 75763
- Fixing index 1
- Fixing index 2
- Fixing index 3
          
---------

- recovering (with sort) MyISAM-table 'nagios_contactnotifications.MYI'
Data records: 75763
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
          
---------

- recovering (with sort) MyISAM-table 'nagios_contacts.MYI'
Data records: 88
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_contactstatus.MYI'
Data records: 88
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_customvariables.MYI'
Data records: 2151
- Fixing index 1
- Fixing index 2
- Fixing index 3
          
---------

- recovering (with sort) MyISAM-table 'nagios_customvariablestatus.MYI'
Data records: 2151
- Fixing index 1
- Fixing index 2
- Fixing index 3
          
---------

- recovering (with keycache) MyISAM-table 'nagios_dbversion.MYI'
Data records: 0
          
---------

- recovering (with sort) MyISAM-table 'nagios_downtimehistory.MYI'
Data records: 950
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_eventhandlers.MYI'
Data records: 11
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_externalcommands.MYI'
Data records: 1832
- Fixing index 1
          
---------

- recovering (with sort) MyISAM-table 'nagios_flappinghistory.MYI'
Data records: 6137
- Fixing index 1
          
---------

- recovering (with sort) MyISAM-table 'nagios_hostchecks.MYI'
Data records: 980
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_host_contactgroups.MYI'
Data records: 1986
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_host_contacts.MYI'
Data records: 142
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_hostdependencies.MYI'
Data records: 1424
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_hostescalation_contactgroups.MYI'
Data records: 1024
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_hostescalation_contacts.MYI'
Data records: 1002
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_hostescalations.MYI'
Data records: 466
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_hostgroup_members.MYI'
Data records: 2568
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_hostgroups.MYI'
Data records: 253
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_host_parenthosts.MYI'
Data records: 1470
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_hosts.MYI'
Data records: 1113
- Fixing index 1
- Fixing index 2
- Fixing index 3
          
---------

- recovering (with sort) MyISAM-table 'nagios_hoststatus.MYI'
Data records: 1113
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
- Fixing index 5
- Fixing index 6
- Fixing index 7
- Fixing index 8
- Fixing index 9
- Fixing index 10
- Fixing index 11
- Fixing index 12
- Fixing index 13
- Fixing index 14
- Fixing index 15
- Fixing index 16
- Fixing index 17
- Fixing index 18
- Fixing index 19
          
---------

- recovering (with sort) MyISAM-table 'nagios_instances.MYI'
Data records: 1
- Fixing index 1
          
---------

- recovering (with sort) MyISAM-table 'nagios_logentries.MYI'
Data records: 911988
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
          
---------

- recovering (with sort) MyISAM-table 'nagios_notifications.MYI'
Data records: 120266
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
- Fixing index 5
          
---------

- recovering (with sort) MyISAM-table 'nagios_objects.MYI'
Data records: 11624
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
- Fixing index 5
          
---------

- recovering (with sort) MyISAM-table 'nagios_processevents.MYI'
Data records: 11406
- Fixing index 1
          
---------

- recovering (with sort) MyISAM-table 'nagios_programstatus.MYI'
Data records: 1
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_runtimevariables.MYI'
Data records: 18
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_scheduleddowntime.MYI'
Data records: 338
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_servicechecks.MYI'
Data records: 3034
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
          
---------

- recovering (with sort) MyISAM-table 'nagios_service_contactgroups.MYI'
Data records: 7009
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_service_contacts.MYI'
Data records: 1407
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_servicedependencies.MYI'
Data records: 620
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_serviceescalation_contactgroups.MYI'
Data records: 4508
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_serviceescalation_contacts.MYI'
Data records: 3995
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_serviceescalations.MYI'
Data records: 1756
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_servicegroup_members.MYI'
Data records: 317
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_servicegroups.MYI'
Data records: 51
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_services.MYI'
Data records: 4179
- Fixing index 1
- Fixing index 2
- Fixing index 3
          
---------

- recovering (with sort) MyISAM-table 'nagios_servicestatus.MYI'
Data records: 4179
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
- Fixing index 5
- Fixing index 6
- Fixing index 7
- Fixing index 8
- Fixing index 9
- Fixing index 10
- Fixing index 11
- Fixing index 12
- Fixing index 13
- Fixing index 14
- Fixing index 15
- Fixing index 16
- Fixing index 17
- Fixing index 18
- Fixing index 19
          
---------

- recovering (with sort) MyISAM-table 'nagios_statehistory.MYI'
Data records: 450349
- Fixing index 1
- Fixing index 2
- Fixing index 3
          
---------

- recovering (with sort) MyISAM-table 'nagios_systemcommands.MYI'
Data records: 98
- Fixing index 1
- Fixing index 2
- Fixing index 3
          
---------

- recovering (with sort) MyISAM-table 'nagios_timedeventqueue.MYI'
Data records: 4610
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
- Fixing index 5
- Fixing index 6
          
---------

- recovering (with sort) MyISAM-table 'nagios_timedevents.MYI'
Data records: 0
- Fixing index 1
- Fixing index 2
- Fixing index 3
- Fixing index 4
- Fixing index 5
- Fixing index 6
          
---------

- recovering (with sort) MyISAM-table 'nagios_timeperiods.MYI'
Data records: 18
- Fixing index 1
- Fixing index 2
          
---------

- recovering (with sort) MyISAM-table 'nagios_timeperiod_timeranges.MYI'
Data records: 108
- Fixing index 1
- Fixing index 2
Starting MySQL:                                            [  OK  ]
~
 
===============
REPAIR COMPLETE
===============
[root@sidhqmonadm0 ~]# service mysqld start
Starting MySQL:                                            [  OK  ]
When we restarted nagios and ndo2db, we found some oddness; host checks were not running, and service checks were not found underneath the hosts. We confirmed that the services still existed and were enabled. It took an apply (with no changes, just an apply) to bring everything back into place and get it running again. Attaching screenshots of strangeness.

Re: Checks stop running randomly

Posted: Mon Jun 18, 2012 11:06 am
by KevinD
As far as the log tails... here is the output, it looks like something happened during the initial startup, and was cleared by the apply.

Start after DB repair

Code: Select all

[2012/06/18 09:44:49] Caught SIGTERM, shutting down...
[2012/06/18 09:44:49] Successfully shutdown... (PID=24442)
[2012/06/18 09:44:58] Event broker module '/usr/local/nagios/lib/dnxPlugin.so' deinitialized successfully.
[2012/06/18 09:44:58] ndomod: Shutdown complete.
[2012/06/18 09:44:58] Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
[2012/06/18 09:47:11] Nagios 3.4.1 starting... (PID=8676)
[2012/06/18 09:47:11] Local time is Mon Jun 18 09:47:11 MDT 2012
[2012/06/18 09:47:11] LOG VERSION: 2.0
[2012/06/18 09:47:11] Event broker module '/usr/local/nagios/lib/dnxPlugin.so' initialized successfully.
[2012/06/18 09:47:11] ndomod: NDOMOD 1.5.1 (05-15-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
[2012/06/18 09:47:11] ndomod: Could not open data sink!  I'll keep trying, but some output may get lost...
[2012/06/18 09:47:11] Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
[2012/06/18 09:47:27] ndomod: Successfully connected to data sink.  25715 items lost, 5000 queued items to flush.
[2012/06/18 09:47:27] ndomod: Successfully flushed 5000 queued items to data sink.
After Apply:

Code: Select all

[2012/06/18 09:56:46] Caught SIGTERM, shutting down...
[2012/06/18 09:56:46] Successfully shutdown... (PID=8690)
[2012/06/18 09:56:47] Event broker module '/usr/local/nagios/lib/dnxPlugin.so' deinitialized successfully.
[2012/06/18 09:56:47] ndomod: Shutdown complete.
[2012/06/18 09:56:47] Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
[2012/06/18 09:56:49] Nagios 3.4.1 starting... (PID=17907)
[2012/06/18 09:56:49] Local time is Mon Jun 18 09:56:49 MDT 2012
[2012/06/18 09:56:49] LOG VERSION: 2.0
[2012/06/18 09:56:49] Event broker module '/usr/local/nagios/lib/dnxPlugin.so' initialized successfully.
[2012/06/18 09:56:49] ndomod: NDOMOD 1.5.1 (05-15-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
[2012/06/18 09:56:49] ndomod: Successfully connected to data sink.  0 queued items to flush.
[2012/06/18 09:56:49] Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
However, we started nagios before starting NDO, so this should simply be from that NDO was not running yet (we see this in the log when we startup in that order)

Re: Checks stop running randomly

Posted: Mon Jun 18, 2012 11:16 am
by gwakem
Please note that we ran the repair on the nagios database only, and not the nagiosql DB. If you believe we should try that also, we can.

Re: Checks stop running randomly

Posted: Mon Jun 18, 2012 1:18 pm
by mguthrie
Yeah, we have been seeing some temporary oddness with ndoutils after the upgrade to 3.x on larger systems. Your culprit is probably right here:

Code: Select all

[2012/06/18 09:47:27] ndomod: Successfully connected to data sink.  25715 items lost, 5000 queued items to flush.
[2012/06/18 09:47:27] ndomod: Successfully flushed 5000 queued items to data sink.
The latest version of ndoutils uses asynchronous writes, so there may have been some oddness on the system at a pretty low level for a little bit. Are you still seeing the inconsistency in the check times?

Re: Checks stop running randomly

Posted: Mon Jun 18, 2012 1:28 pm
by gwakem
We are.. Even after running an apply, checks still randomly stop. They will run for a while and then exhibit the "freeze" issue where last check and next check are both at the same time, and in the past.

Re: Checks stop running randomly

Posted: Mon Jun 18, 2012 1:36 pm
by mguthrie
Can you send us the grep of your system log?

Code: Select all

cat /var/log/messages | grep ndo2db
If ndoutils is dropping any data it would show up there. Can you also post what values you have in /etc/sysctl.conf for the following?

Code: Select all

# Controls the maximum size of a message, in bytes
kernel.msgmnb = 131072000

# Controls the default maxmimum size of a mesage queue
kernel.msgmax = 131072000

Re: Checks stop running randomly

Posted: Mon Jun 18, 2012 2:36 pm
by gwakem
Output from the messages.log for the last few days:

Code: Select all

Jun 12 17:46:56 sidhqmonm0 nagios: ndomod: Please check remote ndo2db log, database connection or SSL Parameters
Jun 12 19:03:30 sidhqmonm0 ndo2db: Error: mysql_query() failed for 'DELETE FROM nagios_timedeventqueue WHERE instance_id='1' AND event_type='0' AND scheduled_time=FROM_UNIXTIME(1339549410) AND recurring_event='0' AND object_id='7625'' 
Jun 12 19:03:30 sidhqmonm0 ndo2db: mysql_error: 'MySQL server has gone away' 
Jun 12 19:03:30 sidhqmonm0 ndo2db: Error: Connection to MySQL database has been lost! 
Jun 12 19:03:30 sidhqmonm0 ndo2db: Error: mysql_query() failed for 'DELETE FROM nagios_timedeventqueue WHERE instance_id='1' AND scheduled_time<FROM_UNIXTIME(1339549410)' 
Jun 12 19:03:30 sidhqmonm0 ndo2db: mysql_error: 'MySQL server has gone away' 
Jun 12 19:03:30 sidhqmonm0 ndo2db: Error: Connection to MySQL database has been lost! 
Jun 12 19:03:30 sidhqmonm0 ndo2db: Error: mysql_query() failed for 'UPDATE nagios_conninfo SET disconnect_time=NOW(), last_checkin_time=NOW(), data_end_time=FROM_UNIXTIME(0), bytes_processed='59756811', lines_processed='5893057', entries_processed='185956' WHERE conninfo_id='1877'' 
Jun 12 19:03:30 sidhqmonm0 ndo2db: mysql_error: 'MySQL server has gone away' 
Jun 12 19:03:30 sidhqmonm0 ndo2db: Error: Connection to MySQL database has been lost! 
Jun 12 19:03:30 sidhqmonm0 nagios: ndomod: Please check remote ndo2db log, database connection or SSL Parameters
Jun 15 14:33:25 sidhqmonm0 ndo2db: Error: queue send error, retrying... 
Jun 15 14:43:38 sidhqmonm0 ndo2db: Error: queue send error, retrying... 
Jun 15 14:46:32 sidhqmonm0 nagios: ndomod: Please check remote ndo2db log, database connection or SSL Parameters
Jun 15 14:48:18 sidhqmonm0 ndo2db: Error: queue send error, retrying... 
Jun 15 14:52:08 sidhqmonm0 ndo2db: Error: queue send error, retrying... 
Jun 15 14:58:28 sidhqmonm0 ndo2db: Error: queue send error, retrying... 
Jun 15 15:05:16 sidhqmonm0 ndo2db: Error: queue send error, retrying... 
Jun 15 15:10:30 sidhqmonm0 ndo2db: Error: queue send error, retrying... 
Jun 15 15:14:45 sidhqmonm0 ndo2db: Error: queue send error, retrying...
Jun 18 07:47:19 sidhqmonm0 ndo2db: Error: mysql_query() failed for 'UPDATE nagios_conninfo SET disconnect_time=NOW(), last_checkin_time=NOW(), data_end_time=FROM_UNIXTIME(0), bytes_processed='0', lines_processed='0', entries_processed='0' WHERE conninfo_id='0'' 
Jun 18 07:47:19 sidhqmonm0 ndo2db: mysql_error: 'MySQL server has gone away' 
Jun 18 07:47:19 sidhqmonm0 ndo2db: Error: Connection to MySQL database has been lost!
The last entry for 07:47 was where someone came in, saw that graphs had stopped, and did an apply.

/etc/sysctl.conf

Code: Select all

# Controls the maximum size of a message, in bytes
kernel.msgmnb = 131072000

# Controls the default maxmimum size of a mesage queue
kernel.msgmax = 131072000

Re: Checks stop running randomly

Posted: Mon Jun 18, 2012 3:28 pm
by mguthrie
Try adding the following lines to /etc/my.cnf underneath

Code: Select all

[mysqld]

Code: Select all

max_connections=200
connect_timeout=30
Then restart mysqld

Code: Select all

service mysqld restart
It appears as though you're either hitting the max connections for mysql or the connections are timing out. Is your mysql on the same server as XI, or is it offloaded to a 2nd machine? (If it's offloaded, run the above commands on the remote machine).