Event Handler stopped working! Part 3

An open discussion forum for obtaining help with Nagios Core. Nagios Core users of all experience levels are welcome here. Subforum have been created for the discussion of Nagios Core and Nagios Plugin development.

NOTE: The SourceForge.net mailing lists have been deprecated in favor of this forum in order to expedite support and provide additional features not available on the old mailing list.

Event Handler stopped working! Part 3

Postby Pitone_Maledetto » Thu Jul 26, 2018 3:09 am

This episode is called "at wit's end" :)

Thanks to @scottwilkerson I managed to get more knowledge on how the timeperiods work, however last night and once before on the 23rd I was back to square one unfortunately.

This is the timeperiod proposed:

Code: Select all
define timeperiod{
        timeperiod_name     hr-s3bt-nm
        alias               LService
        sunday              22:00-24:00
        monday              00:00-06:00
        monday              22:00-24:00
        tuesday             00:00-06:00
        tuesday             22:00-24:00
        wednesday           00:00-06:00
        wednesday           22:00-24:00
        thursday            00:00-06:00
        thursday            22:00-24:00
        friday              00:00-06:00
        friday              22:00-24:00
        saturday            00:00-06:00
        saturday            22:00-24:00
        sunday              00:00-06:00
        }


But last night at around ten to one in the morning I was called and when looked at my custom logs I found the following entry:
Code: Select all
[07-25-2018-21:06:29] - LService can't be resumed during working hours. 0
IO manger pid: 49773
IO manger pid: 56727
[07-26-2018-00:46:29] - LService can't be resumed during working hours. 0


The 0 at the end is the value that I captured assigned to $ISVALIDTIME:hr-s3bt-nm$ at that moment and how you can notice it is correct at 07-25-2018-21:06:29 but not at 07-26-2018-00:46:29.
So I am not sure what's going on and why the macro gets the wrong exit code assigned.
If anyone could help me with this it would be greatly appreciated.

p.s. the issue seems (maybe) to be with the 00:00-06:00 timeperiod entry since I noticed that the service was resumed a couple of times when it was down within the 22:00-24:00 one.

Thank you.
"It is impossible to work in information technology without also engaging in social engineering"
Jaron Lanier
User avatar
Pitone_Maledetto
 
Posts: 62
Joined: Fri Jul 01, 2016 4:11 am
Location: Liverpool, United Kingdom

Re: Event Handler stopped working! Part 3

Postby scottwilkerson » Thu Jul 26, 2018 10:00 am

Can you share your event handler command as well as the script the event handler is running?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
scottwilkerson
DevOps Engineer
 
Posts: 11144
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Re: Event Handler stopped working! Part 3

Postby Pitone_Maledetto » Thu Jul 26, 2018 10:46 am

Hi Scott,

This is the script in resume-hr-lservice:

Code: Select all
#   $SERVICESTATE$=$1
#   $SERVICESTATETYPE$=$2
#   $SERVICEATTEMPT$=$3
#   $HOSTADDRESS$=$4
#   $ISVALIDTIME:hr-s3bt-nm$=$5

NOW=$(date +"%m-%d-%Y-%T")
TIMESTAMP=$(date +%s)

# What state is the service in?
case "$1" in
OK)
    # The service just came back up, so don't do anything...
    ;;

WARNING)
    # Warning statuses are triggered if LService can't be resumed because for example it has been manually stopped.
    ;;

CRITICAL)
    # Is this a "soft" or a "hard" state?
    case "$2" in

    HARD)
        case "$3" in
        1)
            if [ $5 = 1 ] ; then
            # Trying to resume LService.
            /usr/local/nagios/libexec/check_by_ssh -t 120 -H $4 -l nagios -C "sudo /usr/local/bin/start_LService.sh" >> /tmp/resume.txt 2>&1
            elif [ $5 = 0 ] ; then
            # Timeperiod is invalid.
            echo "[$NOW] - LService can't be resumed during working hours. $5" >> /tmp/resume.txt
            fi
            ;;
        esac
        ;;
    esac
    ;;

UNKNOWN)
    # We don't know what might be causing an unknown error, so don't do anything...
    ;;

esac
exit 0


This is the command:

Code: Select all
define command{
   command_name    resume-hr-lservice
   command_line    /usr/local/nagios/libexec/eventhandlers/resume-hr-lservice  $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $HOSTADDRESS$ $ISVALIDTIME:hr-s3bt-nm$
   }


Thank you
"It is impossible to work in information technology without also engaging in social engineering"
Jaron Lanier
User avatar
Pitone_Maledetto
 
Posts: 62
Joined: Fri Jul 01, 2016 4:11 am
Location: Liverpool, United Kingdom

Re: Event Handler stopped working! Part 3

Postby Pitone_Maledetto » Thu Jul 26, 2018 11:03 am

In the meantime I have changed the timeperiod to:
Code: Select all
define timeperiod{
    timeperiod_name    hr-s3bt-nm
    alias              LService
    sunday             06:00-22:00
    monday             06:00-22:00
    tuesday            06:00-22:00
    wednesday          06:00-22:00
    thursday           06:00-22:00
    friday             06:00-22:00
    saturday           06:00-22:00
    }


And changed the:

Code: Select all
if [ $5 = 0 ]


To resume. If 1 don't.
So the inverse with a different timeperiod just to test the theory.
regards
"It is impossible to work in information technology without also engaging in social engineering"
Jaron Lanier
User avatar
Pitone_Maledetto
 
Posts: 62
Joined: Fri Jul 01, 2016 4:11 am
Location: Liverpool, United Kingdom

Re: Event Handler stopped working! Part 3

Postby scottwilkerson » Thu Jul 26, 2018 1:42 pm

Pitone_Maledetto wrote:In the meantime I have changed the timeperiod to:
Code: Select all
define timeperiod{
    timeperiod_name    hr-s3bt-nm
    alias              LService
    sunday             06:00-22:00
    monday             06:00-22:00
    tuesday            06:00-22:00
    wednesday          06:00-22:00
    thursday           06:00-22:00
    friday             06:00-22:00
    saturday           06:00-22:00
    }


And changed the:

Code: Select all
if [ $5 = 0 ]


To resume. If 1 don't.
So the inverse with a different timeperiod just to test the theory.
regards


Actually I think this solution looks much more elegant than even what I proposed.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
scottwilkerson
DevOps Engineer
 
Posts: 11144
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Re: Event Handler stopped working! Part 3

Postby Pitone_Maledetto » Mon Aug 06, 2018 9:29 am

Hi @scottwilkerson,
It all seems to be fine now.
It would be interesting to see why the original solution with the long timeperiod did not work as expected.
Anyhow I am happy to close this thread.
Thank you very much for your help.
Regards
"It is impossible to work in information technology without also engaging in social engineering"
Jaron Lanier
User avatar
Pitone_Maledetto
 
Posts: 62
Joined: Fri Jul 01, 2016 4:11 am
Location: Liverpool, United Kingdom

Re: Event Handler stopped working! Part 3

Postby scottwilkerson » Mon Aug 06, 2018 11:00 am

Great Locking.

I actually am not sure why the other timeperiod I suggested didn't work
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
scottwilkerson
DevOps Engineer
 
Posts: 11144
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises


Return to Nagios Core

Who is online

Users browsing this forum: Google [Bot] and 19 guests