Page 1 of 1
Nagios process crash
Posted: Mon Jun 24, 2013 9:01 am
by uaslam
NagiosXI was working properly, showing alerts and sending notifications as expected until midnight on Friday it crashed.
I can connect to the web UI and it shows stale status information, no updates since Friday midnight.
The monitoring engine status shows a check mark but on mouse over it says "nagios is not running" and running service nagios status returns the same thing.
Here is what we see in /var/log/messages
Jun 22 00:00:00 nagiosand nagios: Caught SIGSEGV, shutting down...
Jun 22 00:00:00 nagiosand ndo2db: Error: mysql_query() failed for 'UPDATE nagios_conninfo SET disconnect_time=NOW(), last_checkin_time=NOW(), data_end_time=FROM_UNIXTIME(0), bytes_processed='0', lines_processed='0', entries_processed='0' WHERE conninfo_id='0''
Jun 22 00:00:00 nagiosand ndo2db: mysql_error: 'MySQL server has gone away'
Jun 22 00:00:00 nagiosand ndo2db: Error: Connection to MySQL database has been lost!
restarting the nagios process fixed the issue but how can we ensure this does not happen again?
Thanks.
Re: Nagios process crash
Posted: Mon Jun 24, 2013 9:19 am
by slansing
Not sure, it could have crashed from any number of reasons, you may have ran out of allocation memory, inodes, etc, and they may have cleared up when something was change. You could have had a massive load increase for some reason at this time too. Are you running anything specifically not having to do with Nagios at this time? Updates, backups, did a large number of services go down and Nagios started to check them rapidly? Unless we can draw connections like that then there is not much that can be done unfortunately.
Re: Nagios process crash
Posted: Mon Jun 24, 2013 9:41 am
by uaslam
We have two nagios servers (the one crashing is not in production) and the production server does not show any such anomalies.
No heavy load/memory issues.
From what I've read, setting check_for_updates=0 in the nagios.cfg file is suggested as a work around as nagios will not check for updates.
A couple of things to note, when we were setting up a clean install of XI on this server, it was working fine but the performance grapher component started show up as down.
Sometime last week, we updated the NagiosXI to the 2012R2.2 from the previous release which also ended up fixing the grapher component. The upgrade process applied a few patches and I do remember seeing a couple of DB errors, something about adding something that already existed or removing that was already removed (dont have any upgrade logs, sorry)
Other forums suggest its NDO DB thats out of sync. Not sure but I have a feeling the underlying problem is the NDO2db back end. Are there any sanity checks I could run and/or somehow reinstall/restore all of it to nagiosxi defaults? Also any DB sanity checks/repairs?
This server is not in production, we have complete nagiosxi configuration backups so really anything is an option.
Thanks,
Usman
Re: Nagios process crash
Posted: Mon Jun 24, 2013 9:52 am
by slansing
Unfortunately, it could be any number of these issues, or none of them at all. Unless you can actively reproduce the crash, or you have logging that can help determine a cause at that specific point in time it is very difficult to find a deterrent for the future. On the same token, it may have just happened once, and will never happen again.
As far as database repairs go you can run the following script to repair the QL and nagios DB:
Code: Select all
/usr/local/nagiosxi/scripts/repairmysql.sh nagios
/usr/local/nagiosxi/scripts/repairmysql.sh nagiosql
I would also verify that all of the green check marks are present on the top right hand spot in your XI web UI.
You may want to review the server's hardware status at:
Admin > System Status
Let us know what the output is on this page.
Do you have a full syslog of the hour leading up to the crash and the minutes afterwords? Same goes for the nagios log?
Re: Nagios process crash
Posted: Mon Jun 24, 2013 10:03 am
by uaslam
slansing wrote:Unfortunately, it could be any number of these issues, or none of them at all. Unless you can actively reproduce the crash, or you have logging that can help determine a cause at that specific point in time it is very difficult to find a deterrent for the future. On the same token, it may have just happened once, and will never happen again.
As far as database repairs go you can run the following script to repair the QL and nagios DB:
Code: Select all
/usr/local/nagiosxi/scripts/repairmysql.sh nagios
/usr/local/nagiosxi/scripts/repairmysql.sh nagiosql
I would also verify that all of the green check marks are present on the top right hand spot in your XI web UI.
You may want to review the server's hardware status at:
Admin > System Status
Let us know what the output is on this page.
Do you have a full syslog of the hour leading up to the crash and the minutes afterwords? Same goes for the nagios log?
Thanks for the DB scripts. We've ran those and enabled the check_for_updates directive. We do have full logs leading up to the crash and I did review them but nothing that stands out.
If it crashes again, I will attach full logs but this can be marked resolved for now. Thanks again!
Just out of curiosity, how frequently does nagios check for updates with the update directive in the nagios.cfg files?
Re: Nagios process crash
Posted: Mon Jun 24, 2013 10:53 am
by lmiltchev
This directive just enables/disables the check for updates. It's run on a cron job once a day.