Nagios Support Forum

Posted: **Thu Jan 16, 2014 3:22 am**

Hi sreinhardt,

Good to hear you feel this might work as work-around, I am getting my hopes up

I want to get the cluster stabilized so desperatly! Any fix satisfies me.

So where do you suggest to insert this line "export malloc_check_=0" in /etc/init.d/nagios ? Does the position matter? This is the file now, maybe you can point out the most logical spot. I would think right after the comments and before the first if statement.

Code: Select all

[root@xxx]# cat /etc/init.d/nagios

#!/bin/sh
# 
# chkconfig: 345 99 01
# description: Nagios network monitor
#
# File : nagios
#
# Author : Jorge Sanchez Aymar ([email protected])
# 
# Changelog :
#
# 1999-07-09 Karl DeBisschop <[email protected]>
#  - setup for autoconf
#  - add reload function
# 1999-08-06 Ethan Galstad <[email protected]>
#  - Added configuration info for use with RedHat's chkconfig tool
#    per Fran Boon's suggestion
# 1999-08-13 Jim Popovitch <[email protected]>
#  - added variable for nagios/var directory
#  - cd into nagios/var directory before creating tmp files on startup
# 1999-08-16 Ethan Galstad <[email protected]>
#  - Added test for rc.d directory as suggested by Karl DeBisschop
# 2000-07-23 Karl DeBisschop <[email protected]>
#  - Clean out redhat macros and other dependencies
# 2003-01-11 Ethan Galstad <[email protected]>
#  - Updated su syntax (Gary Miller)
#
# Description: Starts and stops the Nagios monitor
#              used to provide network services status.
#

# Load any extra environment variables for Nagios and its plugins
if test -f /etc/sysconfig/nagios; then
        . /etc/sysconfig/nagios
fi

status_nagios ()
{

        if test -x $NagiosCGI/daemonchk.cgi; then
                if $NagiosCGI/daemonchk.cgi -l $NagiosRunFile; then
                        return 0
                else
                        return 1
                fi
        else
                if ps -p $NagiosPID > /dev/null 2>&1; then
                        return 0
                else
                        return 1
                fi
        fi

        return 1
}


printstatus_nagios()
{

        if status_nagios $1 $2; then
                echo "nagios (pid $NagiosPID) is running..."
        else
                echo "nagios is not running"
        fi
}


killproc_nagios ()
{

        kill $2 $NagiosPID

}


pid_nagios ()
{

        if test ! -f $NagiosRunFile; then
                echo "No lock file found in $NagiosRunFile"
                exit 1
        fi

        NagiosPID=`head -n 1 $NagiosRunFile`
}


# Source function library
# Solaris doesn't have an rc.d directory, so do a test first
if [ -f /etc/rc.d/init.d/functions ]; then
        . /etc/rc.d/init.d/functions
elif [ -f /etc/init.d/functions ]; then
        . /etc/init.d/functions
fi

prefix=/etc/nagios
exec_prefix=${prefix}
NagiosBin=/usr/sbin/nagios
NagiosCfgFile=/etc/nagios/nagios.cfg
NagiosStatusFile=/var/log/nagios/status.dat
NagiosRetentionFile=/var/log/nagios/retention.dat
NagiosCommandFile=/var/log/nagios/rw/nagios.cmd
NagiosVarDir=/var/log/nagios
NagiosRunFile=/var/log/nagios/nagios.lock
NagiosLockDir=/var/lock/subsys
NagiosLockFile=nagios
NagiosCGIDir=/usr/lib/nagios/cgi
NagiosUser=nagios
NagiosGroup=nagios
          

# Check that nagios exists.
if [ ! -f $NagiosBin ]; then
    echo "Executable file $NagiosBin not found.  Exiting."
    exit 1
fi

# Check that nagios.cfg exists.
if [ ! -f $NagiosCfgFile ]; then
    echo "Configuration file $NagiosCfgFile not found.  Exiting."
    exit 1
fi
          
# See how we were called.
case "$1" in

        start)
                echo -n "Starting nagios:"
                $NagiosBin -v $NagiosCfgFile > /dev/null 2>&1;
                if [ $? -eq 0 ]; then
                        su - $NagiosUser -c "touch $NagiosVarDir/nagios.log $NagiosRetentionFile"
                        rm -f $NagiosCommandFile
                        touch $NagiosRunFile
                        chown $NagiosUser:$NagiosGroup $NagiosRunFile
                        $NagiosBin -d $NagiosCfgFile
                        if [ -d $NagiosLockDir ]; then touch $NagiosLockDir/$NagiosLockFile; fi
                        echo " done."
                        exit 0
                else
                        echo "CONFIG ERROR!  Start aborted.  Check your Nagios configuration."
                        exit 1
                fi
                ;;

        stop)
                echo -n "Stopping nagios: "

                pid_nagios
                killproc_nagios nagios

                # now we have to wait for nagios to exit and remove its
                # own NagiosRunFile, otherwise a following "start" could
                # happen, and then the exiting nagios will remove the
                # new NagiosRunFile, allowing multiple nagios daemons
                # to (sooner or later) run - John Sellens
                #echo -n 'Waiting for nagios to exit .'
                for i in 1 2 3 4 5 6 7 8 9 10 ; do
                    if status_nagios > /dev/null; then
                        echo -n '.'
                        sleep 1
                    else
                        break
                    fi
                done
                if status_nagios > /dev/null; then
                    echo ''
                    echo 'Warning - nagios did not exit in a timely manner'
                else
                    echo 'done.'
                fi

                rm -f $NagiosStatusFile $NagiosRunFile $NagiosLockDir/$NagiosLockFile $NagiosCommandFile
                ;;

        status)
                pid_nagios
                printstatus_nagios nagios
                ;;

        checkconfig)
                printf "Running configuration check..."
                $NagiosBin -v $NagiosCfgFile > /dev/null 2>&1;
                if [ $? -eq 0 ]; then
                        echo " OK."
                else
                        echo " CONFIG ERROR!  Check your Nagios configuration."
                        exit 1
                fi
                ;;

        restart)
                printf "Running configuration check..."
                $NagiosBin -v $NagiosCfgFile > /dev/null 2>&1;
                if [ $? -eq 0 ]; then
                        echo "done."
                        $0 stop
                        $0 start
                else
                        echo " CONFIG ERROR!  Restart aborted.  Check your Nagios configuration."
                        exit 1
                fi
                ;;

        reload|force-reload)
                printf "Running configuration check..."
                $NagiosBin -v $NagiosCfgFile > /dev/null 2>&1;
                if [ $? -eq 0 ]; then
                        echo "done."
                        if test ! -f $NagiosRunFile; then
                                $0 start
                        else
                                pid_nagios
                                if status_nagios > /dev/null; then
                                        printf "Reloading nagios configuration..."
                                        killproc_nagios nagios -HUP
                                        echo "done"
                                else
                                        $0 stop
                                        $0 start
                                fi
                        fi
                else
                        echo " CONFIG ERROR!  Reload aborted.  Check your Nagios configuration."
                        exit 1
                fi
                ;;

        *)
                echo "Usage: nagios {start|stop|restart|reload|force-reload|status|checkconfig}"
                exit 1
                ;;

esac
  
# End of this script

To answer your question, here's the info:

Code: Select all

[root@xxx]# uname -a
Linux padm122 2.6.32-358.23.2.el6.x86_64 #1 SMP Sat Sep 14 05:32:37 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

[root@xxx]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.4 (Santiago)

Posted: **Thu Jan 16, 2014 4:00 am**

P.S. i see this exact same behavior on every crash in the strace output. It really seems like pattern.

Besides strace I do not know of another debugging method, does anyone have any advice on it (some equivalent tool)? I use strace without specific parameters, only the PID.

Posted: **Thu Jan 16, 2014 1:09 pm**

I would make the change in the start portion of the init script:

Code: Select all

# See how we were called.
case "$1" in

        start)
                echo -n "Starting nagios:"
                $NagiosBin -v $NagiosCfgFile > /dev/null 2>&1;
                if [ $? -eq 0 ]; then
                        su - $NagiosUser -c "touch $NagiosVarDir/nagios.log $NagiosRetentionFile"
                        rm -f $NagiosCommandFile
                        touch $NagiosRunFile
                        chown $NagiosUser:$NagiosGroup $NagiosRunFile
                        $NagiosBin -d $NagiosCfgFile
                        if [ -d $NagiosLockDir ]; then touch $NagiosLockDir/$NagiosLockFile; fi
                        echo " done."
                        exit 0
                else
                        echo "CONFIG ERROR!  Start aborted.  Check your Nagios configuration."
                        exit 1
                fi
                ;;

to:

Code: Select all

# See how we were called.
case "$1" in

        start)
                echo -n "Starting nagios:"
                $NagiosBin -v $NagiosCfgFile > /dev/null 2>&1;
                if [ $? -eq 0 ]; then
                        su - $NagiosUser -c "touch $NagiosVarDir/nagios.log $NagiosRetentionFile"
                        rm -f $NagiosCommandFile
                        touch $NagiosRunFile
                        chown $NagiosUser:$NagiosGroup $NagiosRunFile
#### Change here ####
                        export malloc_check_=0 $NagiosBin -d $NagiosCfgFile
#### End change ####
                        if [ -d $NagiosLockDir ]; then touch $NagiosLockDir/$NagiosLockFile; fi
                        echo " done."
                        exit 0
                else
                        echo "CONFIG ERROR!  Start aborted.  Check your Nagios configuration."
                        exit 1
                fi
                ;;

If possible, I would love to see a full core dump of when this happens. It might point us to where this is actually happening.

Posted: **Fri Jan 17, 2014 4:10 am**

Thanks a lot for identifying the place and way of putting it with so much clarity.

I will do the required changes and will keep all of you posted on the behaviour of the work-around.

Posted: **Fri Jan 17, 2014 2:37 pm**

We will await your reply.

Posted: **Wed Jan 22, 2014 8:03 am**

I did the suggested change yesterday on the four nodes. So far so good, now wait and see.

About the core dumps, could you please advice on how to go about this? I think Nagios will create one automatically and save it in /tmp? I never seen a file created.

This parameter is set in nagios.cfg:

Code: Select all

daemon_dumps_core=1

But what will the location be?

Thanks.

Posted: **Wed Jan 22, 2014 3:48 pm**

I'm not a 100% sure - many people claim that it *should* be in "/var/tmp/". I believe you can set it in the "httpd.conf", for example you can put this line in the "Global Environment" section:

Code: Select all

CoreDumpDirectory /tmp

and restart apache:

Code: Select all

service httpd restart

Posted: **Thu Jan 23, 2014 11:23 am**

Will this effect where Nagios stores it core dumps? I know Nagios is presented through Apache with HTML files and CGI scripts, but it's the Nagios daemon itself which crashes. This looks like a parameter for Apache itself should it crash, or am I mistaking here?

Posted: **Fri Jan 24, 2014 2:27 pm**

We are doing some looking into this but it is proving difficult to pin down, we will try to get back to you as soon as possible as to where these are dumped.

Posted: **Mon Feb 03, 2014 7:44 am**

Hi There,

That environment variable "export malloc_check_=1" was not able to stop the crash.

By seeing the error log, did couple of more changes. Which are as follows,

Prev:- cached_host_check_horizon=15
Now:- cached_host_check_horizon=30

prev:- external_command_buffer_slots=4096
Now:- external_command_buffer_slots=8192

prev:- cached_service_check_horizon=15
now:- cached_service_check_horizon=30

Prev:- #free_child_process_memory=1
now:- free_child_process_memory=0

prev:- #child_processes_fork_twice=1
now:- child_processes_fork_twice=1

ocsp_timeout=10:- Changed it from 5 to 10 secs.

perfdata_timeout=10:- Changed it from 5 to 10 secs.

use_large_installation_tweaks=1: Changed it from disable state to enable state.

But, Hard Luck, All those did not worked.(Not All the above changed at once, it was in a phases). So rolled back all those changes(except, "export malloc_check_=1" as to me it does not have any real reason to not to be there)

Surprisingly, there is one parameter, "allow_empty_hostgroup_assignment=1" which is thete on http://nagios.sourceforge.net/docs/nagi ... gmain.html config option, but missing from all the nagios.cfg file in all the four nodes. Not sure why it is not there, is it something that this option been included with recent updates?

I added this parameter, and so far so good it did not crashed since Last thursday. Would like to have your view on this. will keep updating you all on the progress.

Nagios Support Forum

Nagios daemon crashing frequently (extensive logs attached)

Re: Nagios daemon crashing frequently (extensive logs attach

Re: Nagios daemon crashing frequently (extensive logs attach

Re: Nagios daemon crashing frequently (extensive logs attach

Re: Nagios daemon crashing frequently (extensive logs attach

Re: Nagios daemon crashing frequently (extensive logs attach

Re: Nagios daemon crashing frequently (extensive logs attach

Re: Nagios daemon crashing frequently (extensive logs attach

Re: Nagios daemon crashing frequently (extensive logs attach

Re: Nagios daemon crashing frequently (extensive logs attach

Re: Nagios daemon crashing frequently (extensive logs attach