Page 2 of 3

Re: Engine hosed after network disconnect

Posted: Mon Oct 07, 2013 4:56 pm
by abrist
... Hey guys!

NDO is resilient enough to reconnect the data sink once the db is back. The issue here may be how long it is down. Checkresults will backup as they cannot be written to the historical db, and if the db is down long enough, the engine may have an uphill battle between new checks and cached old checks. It could take a considerable amount of time for everything to normalize if the outage was of a decently long duration (or the number of checks per 5 minutes is very high)..

Re: Engine hosed after network disconnect

Posted: Tue Oct 08, 2013 8:17 am
by vAJ
Where might I find indicators of such a backlog? Any known error messages and from which log files?

Re: Engine hosed after network disconnect

Posted: Tue Oct 08, 2013 9:39 am
by slansing
I would look in the following logs:

Code: Select all

/var/log/messages

Code: Select all

/var/log/mysqld.log
You may also find some useful information from what happened when the outage hit your network in:

Code: Select all

/usr/local/nagios/var/nagios.log
You can also find the nagios log archives in the above directory.

In addition, do you have logging set up on the remote mysql server? At the least a log showing when mysqld is initiated? Possibly the syslog?

I'd also check to see if you have an inordinate amount of temp check* files in /tmp/. Or in:

Code: Select all

/usr/local/nagios/var/spool/checkresults/

Re: Engine hosed after network disconnect

Posted: Tue Oct 08, 2013 9:53 am
by vAJ
Well, we just found an issue yesterday where logging was not rotating. /var/log/messages is about 146MB

Re: Engine hosed after network disconnect

Posted: Tue Oct 08, 2013 11:07 am
by slansing
Dear Lord that is a lot. Can you run the following and share the output:

Code: Select all

cat /var/lib/logrotate.status | grep messages

Re: Engine hosed after network disconnect

Posted: Tue Oct 08, 2013 1:30 pm
by vAJ
2013-10-6

Re: Engine hosed after network disconnect

Posted: Wed Oct 09, 2013 10:09 am
by sreinhardt
So that 146mb after rotation and within the two days or prior to rotation and some extended period of time? As you said, thats a pretty big log! Hopefully clearing that up some should let you see ndo messages just a bit more clearly. :D

Re: Engine hosed after network disconnect

Posted: Wed Oct 09, 2013 12:06 pm
by vAJ
Log still hadn't rotated this morning, LInux admin tells me issued a rotate command but it said it wasn't necessary. So he forced it.

Rotate schedule was set to weekly. we changed to daily. Will know more tomorrow.

On another note, we had another stab at a big L3 network change last night and the engine did not fall over. Only difference seems to be that I rebooted the box yesterday and it had very little buffer/cache used on the box. Usually it's taking up 19GB in cache on a box with 24GB physical mem.

Re: Engine hosed after network disconnect

Posted: Wed Oct 09, 2013 4:03 pm
by abrist
I haven't noticed any big upstream kernel bugs with caching recently. I guess we will not have a good test until ram fills up with disk cache again. Let us know what happened with the log rotation today.

Re: Engine hosed after network disconnect

Posted: Thu Oct 10, 2013 12:43 pm
by vAJ
OK. Log rotation is good at 24hrs now.