Nagios not starting - SIGSEGV error

Fred Kroeger · Post by **Fred Kroeger** » Mon Apr 11, 2016 9:02 pm

This morning woke up to find the Nagios process had stopped. Looking at the messages log, it appear to start normally, then starts processing all the downtimes and then falls over again.

Code: Select all

Apr 12 09:45:09 nagios: SERVICE DOWNTIME ALERT: host1.com.au;ZPOOL Status;STARTED; Service has entered a period of scheduled downtime
Apr 12 09:45:09 nagios: SERVICE DOWNTIME ALERT: host2;CPU Usage;STARTED; Service has entered a period of scheduled downtime
Apr 12 09:45:09 nagios: SERVICE DOWNTIME ALERT: host2;Swap Usage;STARTED; Service has entered a period of scheduled downtime
Apr 12 09:45:09 nagios: HOST DOWNTIME ALERT: host3;STARTED; Host has entered a period of scheduled downtime
Apr 12 09:45:09 nagios: SERVICE DOWNTIME ALERT: host4;/ Disk Free - Linux;STOPPED; Service has exited from a period of scheduled downtime
Apr 12 09:45:09 nagios: HOST DOWNTIME ALERT: host6;STOPPED; Host has exited from a period of scheduled downtime
Apr 12 09:45:09 nagios: Caught SIGSEGV, shutting down...

Looking at the Scheduled downtiime screen all these were scheduled for downtime at 04:30 (when it all stopped working) and ends at 05:30. They are recurring downtimes.

I've checked the debug.log file (Level=1) and it looks like it is trying to process the scheduled downtimes:

Code: Select all

[1460428580.088375] [001.0] [pid=15596] create_notification_list_from_host()
[1460428580.088401] [001.0] [pid=15596] should_host_notification_be_escalated()
[1460428580.088418] [001.0] [pid=15596] check_contact_host_notification_viability()
[1460428580.088430] [001.0] [pid=15596] check_contact_host_notification_viability()
[1460428580.088442] [001.0] [pid=15596] check_contact_host_notification_viability()
[1460428580.088456] [001.0] [pid=15596] check_time_against_period()
[1460428580.088490] [001.0] [pid=15596] _get_matching_timerange()
[1460428580.088511] [001.0] [pid=15596] check_contact_host_notification_viability()
[1460428580.088520] [001.0] [pid=15596] check_time_against_period()
[1460428580.088537] [001.0] [pid=15596] _get_matching_timerange()
[1460428580.088556] [001.0] [pid=15596] check_contact_host_notification_viability()
[1460428580.088570] [001.0] [pid=15596] check_contact_host_notification_viability()
[1460428580.088583] [001.0] [pid=15596] check_contact_host_notification_viability()
[1460428580.088593] [001.0] [pid=15596] check_time_against_period()
[1460428580.088614] [001.0] [pid=15596] _get_matching_timerange()
[1460428580.088633] [001.0] [pid=15596] check_contact_host_notification_viability()
[1460428580.088642] [001.0] [pid=15596] check_time_against_period()
[1460428580.088659] [001.0] [pid=15596] _get_matching_timerange()
[1460428580.088677] [001.0] [pid=15596] check_contact_host_notification_viability()
[1460428580.088686] [001.0] [pid=15596] check_time_against_period()
[1460428580.088716] [001.0] [pid=15596] _get_matching_timerange()
[1460428580.088996] [001.0] [pid=15596] find_downtime()
[1460428580.089165] [001.0] [pid=15596] handle_scheduled_downtime()
[1460428580.117268] [001.0] [pid=15610] clear_volatile_macros_r()

Is there anyway I can delete the downtime entries to isolate whether this is causing the issues ?
As nagios is not running, I can't update the nagios.cmd file manually.

This is pretty urgent as this is our Prod site. Any suggestions appreciated.

regards... Fred

Post by **Box293** » Tue Apr 12, 2016 2:52 am

These are all stored in /usr/local/nagios/var/status.dat

Example:

Code: Select all

hostdowntime {
        host_name=localhost
        downtime_id=17
        comment_id=35
        entry_time=1460447202
        start_time=1460447189
        flex_downtime_start=0
        end_time=1460454389
        triggered_by=0
        fixed=1
        duration=7200
        is_in_effect=1
        start_notification_sent=1
        author=Nagios Administrator
        comment=down_test
        }

and also a comment:

Code: Select all

hostcomment {
        host_name=localhost
        entry_type=2
        comment_id=35
        source=0
        persistent=0
        entry_time=1460447202
        expires=0
        expire_time=0
        author=Nagios Administrator
        comment_data=This host has been scheduled for fixed downtime from 04-12-2016 17:46:29 to 04-12-2016 19:46:29.  Notifications for the host will not be sent out during that time period.
        }

I also find that SIGSEGV errors can be related to broker modules like MK Livestats and Mod Gearman.

Fred Kroeger · Post by **Fred Kroeger** » Tue Apr 12, 2016 5:29 am

Thanks Troy
Yes I did comment out the gearman broker - but no difference.
Issue has been resolved by :
1) Importing latest backup to DR server
2) Confirming that all works OK
3) Restore from that latest backup

At this point it still wouldn't start - still had the SIGSEGV error
I then :

4) Copied the retention.dat from the DR server to Prod
5) Restarted Nagios & all started working.

Maybe there is an easier way.... but this worked.

Regards.... Fred

Nagios Support Forum

Nagios not starting - SIGSEGV error

Nagios not starting - SIGSEGV error

Re: Nagios not starting - SIGSEGV error

Re: Nagios not starting - SIGSEGV error