Nagios XI Services not working

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
askewdread
Posts: 69
Joined: Wed Nov 16, 2016 4:54 pm

Nagios XI Services not working

Post by askewdread »

Hi,

for some strange reason the systemctl part of starting the nagios service appears to have broken somehow....

i applied a config earlier and nagios just stopped working...
i checked the config - all was ok.
if i start the process via systemctl i can see it starts, all the workers start and then they all close again....
if i start it manually using '/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg' it works perfectly
until i make another config commit and then it dies again

if i do systemctl status nagios when its broken it says its runnign but if i do /etc/init.d/nagios status it says it is not running

does anyone have any clue what could do that?
askewdread
Posts: 69
Joined: Wed Nov 16, 2016 4:54 pm

Re: Nagios XI Services not working

Post by askewdread »

hmmm after looking at the debug log i could see it was dying at the downtime part of loading..... i found i had 200 downtimes queued up from a script i was working on earlier, so have cleaned those up and now it seems to start ok...

seems kind of bad if it crashes just because of that.... hopefully it doesnt reoccur :)
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: Nagios XI Services not working

Post by dwhitfield »

We can leave this open in case it reoccurs.

If it does reoccur, we're probably going to want to look at your profile. *If it does reoccur* can you PM me your Profile? You can download it by going to Admin > System Config > System Profile and click the Download Profile button towards the top. If for whatever reason you *cannot* download the profile, please put the output of View System Info (5.3.4+, Show Profile if older) in the thread (that will at least get us some info).

After you PM the profile, please update this thread. Updating this thread is the only way for it to show back up on our dashboard.

UPDATE: Profile received and shared with techs.
askewdread
Posts: 69
Joined: Wed Nov 16, 2016 4:54 pm

Re: Nagios XI Services not working

Post by askewdread »

Hi, this appears to be happening again.... every time i try to start it manually im getting segmentation fault.....

logs below

/usr/local/nagios/bin/nagios /usr/local/nagios/etc/nagios.cfg

Code: Select all

Nagios 4.2.4 starting... (PID=45615)
Local time is Tue Jan 24 07:40:37 NZDT 2017
nerd: Channel hostchecks registered successfully
nerd: Channel servicechecks registered successfully
nerd: Channel opathchecks registered successfully
nerd: Fully initialized and ready to rock!
wproc: Successfully registered manager as @wproc with query handler
wproc: Registry request: name=Core Worker 45616;pid=45616
wproc: Registry request: name=Core Worker 45617;pid=45617
wproc: Registry request: name=Core Worker 45618;pid=45618
wproc: Registry request: name=Core Worker 45619;pid=45619
wproc: Registry request: name=Core Worker 45620;pid=45620
wproc: Registry request: name=Core Worker 45621;pid=45621
wproc: Registry request: name=Core Worker 45624;pid=45624
wproc: Registry request: name=Core Worker 45623;pid=45623
wproc: Registry request: name=Core Worker 45622;pid=45622
wproc: Registry request: name=Core Worker 45626;pid=45626
wproc: Registry request: name=Core Worker 45625;pid=45625
wproc: Registry request: name=Core Worker 45627;pid=45627
Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
Segmentation fault


/var/log/messages

Code: Select all

Jan 24 07:40:38 dnzbsglnx10 kernel: nagios[45615]: segfault at 201e000 ip 00007f93059f3f8f sp 00007ffe421c0d48 error 6 in libc-2.17.so[7f930596f000+1b6000]

nagios.debug

Code: Select all

[1485197203.173397] [001.0] [pid=47827] get_next_host_notification_time()
[1485197203.180189] [001.0] [pid=47827] get_next_host_notification_time()
[1485197203.181502] [001.0] [pid=47827] get_next_host_notification_time()
[1485197203.181936] [001.0] [pid=47827] get_next_host_notification_time()
[1485197203.236109] [001.0] [pid=47827] get_next_service_notification_time()
[1485197203.249485] [001.0] [pid=47827] get_next_service_notification_time()
[1485197203.351490] [001.0] [pid=47827] get_next_service_notification_time()
[1485197203.356700] [001.0] [pid=47827] get_next_service_notification_time()
[1485197203.360962] [001.0] [pid=47827] get_next_service_notification_time()
[1485197203.364135] [001.0] [pid=47827] sort_downtime()
askewdread
Posts: 69
Joined: Wed Nov 16, 2016 4:54 pm

Re: Nagios XI Services not working

Post by askewdread »

renaming retention.dat has allowed it to start manually by using /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg but using systemctl start nagios doesnt work at all
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Nagios XI Services not working

Post by tgriep »

Could you post this file so we check and see if it is corrupted in anyway?

Code: Select all

/etc/init.d/nagios
Be sure to check out our Knowledgebase for helpful articles and solutions!
askewdread
Posts: 69
Joined: Wed Nov 16, 2016 4:54 pm

Re: Nagios XI Services not working

Post by askewdread »

tgriep wrote:Could you post this file so we check and see if it is corrupted in anyway?

Code: Select all

/etc/init.d/nagios
below.... usually the service starts fine... it just randomly stops working

Code: Select all

#!/bin/sh
#
# chkconfig: 345 99 01
# description: Nagios network monitor
# processname: nagios
# File : nagios
#
# Author : Jorge Sanchez Aymar ([email protected])
#
# Changelog :
#
# 1999-07-09 Karl DeBisschop <[email protected]>
#  - setup for autoconf
#  - add reload function
# 1999-08-06 Ethan Galstad <[email protected]>
#  - Added configuration info for use with RedHat's chkconfig tool
#    per Fran Boon's suggestion
# 1999-08-13 Jim Popovitch <[email protected]>
#  - added variable for nagios/var directory
#  - cd into nagios/var directory before creating tmp files on startup
# 1999-08-16 Ethan Galstad <[email protected]>
#  - Added test for rc.d directory as suggested by Karl DeBisschop
# 2000-07-23 Karl DeBisschop <[email protected]>
#  - Clean out redhat macros and other dependencies
# 2003-01-11 Ethan Galstad <[email protected]>
#  - Updated su syntax (Gary Miller)
#
# Description: Starts and stops the Nagios monitor
#              used to provide network services status.
#
### BEGIN INIT INFO
# Provides:		nagios
# Required-Start:	$local_fs $syslog $network
# Required-Stop:	$local_fs $syslog $network
# Short-Description:	Starts and stops the Nagios monitoring server
# Description:		Starts and stops the Nagios monitoring server
### END INIT INFO

# Our install-time configuration.
prefix=/usr/local/nagios
exec_prefix=${prefix}
NagiosBin=${exec_prefix}/bin/nagios
NagiosCfgFile=${prefix}/etc/nagios.cfg
NagiosCfgtestFile=${prefix}/var/nagios.configtest
NagiosStatusFile=${prefix}/var/status.dat
NagiosRetentionFile=${prefix}/var/retention.dat
NagiosCommandFile=${prefix}/var/rw/nagios.cmd
NagiosVarDir=${prefix}/var
NagiosRunFile=${prefix}/var/nagios.lock
NagiosLockDir=/usr/local/nagiosxi/var/subsys
#NagiosLockDir=/var/lock/subsys
NagiosLockFile=nagios
NagiosCGIDir=${exec_prefix}/sbin
NagiosUser=nagios
NagiosGroup=nagios
checkconfig="true"

# Source function library
# Some *nix do not have an rc.d directory, so do a test first
if [ -f /etc/rc.d/init.d/functions ]; then
	. /etc/rc.d/init.d/functions
elif [ -f /etc/init.d/functions ]; then
	. /etc/init.d/functions
elif [ -f /lib/lsb/init-functions ]; then
	. /lib/lsb/init-functions
fi

# Load any extra environment variables for Nagios and its plugins.
if test -f /etc/sysconfig/nagios; then
	. /etc/sysconfig/nagios
fi

# Automate addition of RAMDISK based on environment variables
USE_RAMDISK=${USE_RAMDISK:-0}
if test "$USE_RAMDISK" -ne 0 && test "$RAMDISK_SIZE"X != "X"; then
	ramdisk=`mount |grep "${RAMDISK_DIR} type tmpfs"`
	if [ "$ramdisk"X == "X" ]; then
		mkdir -p -m 0755 ${RAMDISK_DIR}
		mount -t tmpfs -o size=${RAMDISK_SIZE}m tmpfs ${RAMDISK_DIR}
		mkdir -p -m 0755 ${RAMDISK_DIR}/checkresults
		chown -R $NagiosUser:$NagiosGroup ${RAMDISK_DIR}
	fi
fi


check_config ()
{
	TMPFILE=$(mktemp /tmp/.configtest.XXXXXXXX)
	$NagiosBin -vp $NagiosCfgFile > "$TMPFILE"
	WARN=`grep ^"Total Warnings:" "$TMPFILE" |awk -F: '{print \$2}' |sed s/' '//g`
	ERR=`grep ^"Total Errors:" "$TMPFILE" |awk -F: '{print \$2}' |sed s/' '//g`

	if test "$WARN" = "0" && test "${ERR}" = "0"; then
		echo "OK - Configuration check verified" > $NagiosCfgtestFile
		chmod 0644 $NagiosCfgtestFile
		chown $NagiosUser:$NagiosGroup $NagiosCfgtestFile
		/bin/rm "$TMPFILE"
		return 0
	elif test "${ERR}" = "0"; then
		# Write the errors to a file we can have a script watching for.
		echo "WARNING: Warnings in config files - see log for details: $NagiosCfgtestFile" > $NagiosCfgtestFile
		egrep -i "(^warning|^error)" "$TMPFILE" >> $NagiosCfgtestFile
		chmod 0644 $NagiosCfgtestFile
		chown $NagiosUser:$NagiosGroup $NagiosCfgtestFile
		/bin/rm "$TMPFILE"
		return 0
	else
		# Write the errors to a file we can have a script watching for.
		echo "ERROR: Errors in config files - see log for details: $NagiosCfgtestFile" > $NagiosCfgtestFile
		egrep -i "(^warning|^error)" "$TMPFILE" >> $NagiosCfgtestFile
		chmod 0644 $NagiosCfgtestFile
		chown $NagiosUser:$NagiosGroup $NagiosCfgtestFile
		cat "$TMPFILE"
		exit 8
	fi
}


status_nagios ()
{
	if test -x $NagiosCGI/daemonchk.cgi; then
		if $NagiosCGI/daemonchk.cgi -l $NagiosRunFile > /dev/null 2>&1; then return 0; fi
	else
		if ps -p $NagiosPID > /dev/null 2>&1; then return 0; fi
	fi

	return 1
}

printstatus_nagios ()
{
	if status_nagios; then
		echo "nagios (pid $NagiosPID) is running..."
	else
		echo "nagios is not running"
        exit 3
	fi
}

killproc_nagios ()
{
	kill -s "$1" $NagiosPID
}

pid_nagios ()
{
	if test ! -f $NagiosRunFile; then
		echo "No lock file found in $NagiosRunFile"
		exit 3
	fi

	NagiosPID=`head -n 1 $NagiosRunFile`
}



# Check that nagios exists.
if [ ! -f $NagiosBin ]; then
    echo "Executable file $NagiosBin not found. Exiting."
    exit 1
fi

# Check that nagios.cfg exists.
if [ ! -f $NagiosCfgFile ]; then
    echo "Configuration file $NagiosCfgFile not found. Exiting."
    exit 1
fi

# See how we were called.
case "$1" in

	start)
		echo -n "Starting nagios:"

		if test "$checkconfig" = "true"; then
			check_config
			# check_config exits on configuration errors.
		fi

		if test -f $NagiosRunFile; then
			NagiosPID=`head -n 1 $NagiosRunFile`
			if status_nagios; then
				echo " another instance of nagios is already running."
				exit 0
			fi
		fi

		touch $NagiosVarDir/nagios.log $NagiosRetentionFile
		rm -f $NagiosCommandFile
		touch $NagiosRunFile
		chown $NagiosUser:$NagiosGroup $NagiosRunFile $NagiosVarDir/nagios.log $NagiosRetentionFile
		USER=$NagiosUser G_BROKEN_FILENAMES=1 SSH_TTY=/dev/pts/0 $NagiosBin -d $NagiosCfgFile
		if [ -d $NagiosLockDir ]; then touch $NagiosLockDir/$NagiosLockFile; fi
			
		service snmptt restart &>/dev/null ||:

		echo " done."
		;;

	stop)
		echo -n "Stopping nagios:"

		pid_nagios
		killproc_nagios TERM

		# now we have to wait for nagios to exit and remove its
		# own NagiosRunFile, otherwise a following "start" could
		# happen, and then the exiting nagios will remove the
		# new NagiosRunFile, allowing multiple nagios daemons
		# to (sooner or later) run - John Sellens
		#echo -n 'Waiting for nagios to exit .'
		for i in 1 2 3 4 5 6 7 8 9 10 ; do
			if status_nagios > /dev/null; then
				echo -n '.'
				sleep 1
			else
				break
			fi
		done
		if status_nagios > /dev/null; then
			echo ''
			echo 'Warning - nagios did not exit in a timely manner'
		else
			echo ' done.'
		fi

		rm -f $NagiosStatusFile $NagiosRunFile $NagiosLockDir/$NagiosLockFile $NagiosCommandFile
		;;

	status)
		pid_nagios
		printstatus_nagios
		;;

	checkconfig)
		if test "$checkconfig" = "true"; then
			printf "Running configuration check...\n"
			check_config
		fi

		if [ $? -eq 0 ]; then
			echo " OK."
		else
			echo " CONFIG ERROR!  Check your Nagios configuration."
			exit 1
		fi
		;;

	restart)
		if test "$checkconfig" = "true"; then
			printf "Running configuration check...\n"
			check_config
		fi

		$0 stop
		$0 start
		;;

	reload|force-reload)
		if test "$checkconfig" = "true"; then
			printf "Running configuration check...\n"
			check_config
		fi

		if test ! -f $NagiosRunFile; then
			$0 start
		else
			pid_nagios
			if status_nagios > /dev/null; then
				printf "Reloading nagios configuration...\n"
				killproc_nagios HUP
				echo "done"
			else
				$0 stop
				$0 start
			fi
		fi
		;;

	configtest)
		$NagiosBin -vp $NagiosCfgFile
		;;

	*)
		echo "Usage: nagios {start|stop|restart|reload|force-reload|status|checkconfig|configtest}"
		exit 1
		;;

esac

# End of this script

askewdread
Posts: 69
Joined: Wed Nov 16, 2016 4:54 pm

Re: Nagios XI Services not working

Post by askewdread »

just happened again on config commit, and the only way to get it to start again was by renaming retention.dat
but sadly that gets rid of any services we have acknowledged, and of course loses all history etc....
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Nagios XI Services not working

Post by tgriep »

The Nagios init script looks fine, thanks for posting it.
Can you post or PM me a bad retention.dat file from your server that causes the segmentation fault so we can view it?
Be sure to check out our Knowledgebase for helpful articles and solutions!
askewdread
Posts: 69
Joined: Wed Nov 16, 2016 4:54 pm

Re: Nagios XI Services not working

Post by askewdread »

good timing just did it again ;)
i have pm'd you the retention.dat file
Locked