Nagios not starting - SIGSEGV error
Posted: Mon Apr 11, 2016 9:02 pm
This morning woke up to find the Nagios process had stopped. Looking at the messages log, it appear to start normally, then starts processing all the downtimes and then falls over again.
Looking at the Scheduled downtiime screen all these were scheduled for downtime at 04:30 (when it all stopped working) and ends at 05:30. They are recurring downtimes.
I've checked the debug.log file (Level=1) and it looks like it is trying to process the scheduled downtimes:
Is there anyway I can delete the downtime entries to isolate whether this is causing the issues ?
As nagios is not running, I can't update the nagios.cmd file manually.
This is pretty urgent as this is our Prod site. Any suggestions appreciated.
regards... Fred
Code: Select all
Apr 12 09:45:09 nagios: SERVICE DOWNTIME ALERT: host1.com.au;ZPOOL Status;STARTED; Service has entered a period of scheduled downtime
Apr 12 09:45:09 nagios: SERVICE DOWNTIME ALERT: host2;CPU Usage;STARTED; Service has entered a period of scheduled downtime
Apr 12 09:45:09 nagios: SERVICE DOWNTIME ALERT: host2;Swap Usage;STARTED; Service has entered a period of scheduled downtime
Apr 12 09:45:09 nagios: HOST DOWNTIME ALERT: host3;STARTED; Host has entered a period of scheduled downtime
Apr 12 09:45:09 nagios: SERVICE DOWNTIME ALERT: host4;/ Disk Free - Linux;STOPPED; Service has exited from a period of scheduled downtime
Apr 12 09:45:09 nagios: HOST DOWNTIME ALERT: host6;STOPPED; Host has exited from a period of scheduled downtime
Apr 12 09:45:09 nagios: Caught SIGSEGV, shutting down...
I've checked the debug.log file (Level=1) and it looks like it is trying to process the scheduled downtimes:
Code: Select all
[1460428580.088375] [001.0] [pid=15596] create_notification_list_from_host()
[1460428580.088401] [001.0] [pid=15596] should_host_notification_be_escalated()
[1460428580.088418] [001.0] [pid=15596] check_contact_host_notification_viability()
[1460428580.088430] [001.0] [pid=15596] check_contact_host_notification_viability()
[1460428580.088442] [001.0] [pid=15596] check_contact_host_notification_viability()
[1460428580.088456] [001.0] [pid=15596] check_time_against_period()
[1460428580.088490] [001.0] [pid=15596] _get_matching_timerange()
[1460428580.088511] [001.0] [pid=15596] check_contact_host_notification_viability()
[1460428580.088520] [001.0] [pid=15596] check_time_against_period()
[1460428580.088537] [001.0] [pid=15596] _get_matching_timerange()
[1460428580.088556] [001.0] [pid=15596] check_contact_host_notification_viability()
[1460428580.088570] [001.0] [pid=15596] check_contact_host_notification_viability()
[1460428580.088583] [001.0] [pid=15596] check_contact_host_notification_viability()
[1460428580.088593] [001.0] [pid=15596] check_time_against_period()
[1460428580.088614] [001.0] [pid=15596] _get_matching_timerange()
[1460428580.088633] [001.0] [pid=15596] check_contact_host_notification_viability()
[1460428580.088642] [001.0] [pid=15596] check_time_against_period()
[1460428580.088659] [001.0] [pid=15596] _get_matching_timerange()
[1460428580.088677] [001.0] [pid=15596] check_contact_host_notification_viability()
[1460428580.088686] [001.0] [pid=15596] check_time_against_period()
[1460428580.088716] [001.0] [pid=15596] _get_matching_timerange()
[1460428580.088996] [001.0] [pid=15596] find_downtime()
[1460428580.089165] [001.0] [pid=15596] handle_scheduled_downtime()
[1460428580.117268] [001.0] [pid=15610] clear_volatile_macros_r()Is there anyway I can delete the downtime entries to isolate whether this is causing the issues ?
As nagios is not running, I can't update the nagios.cmd file manually.
This is pretty urgent as this is our Prod site. Any suggestions appreciated.
regards... Fred