Event Handler stopped working! Part 3

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
User avatar
Pitone_Maledetto
Posts: 69
Joined: Fri Jul 01, 2016 4:11 am
Location: Liverpool, United Kingdom

Event Handler stopped working! Part 3

Post by Pitone_Maledetto »

This episode is called "at wit's end" :)

Thanks to @scottwilkerson I managed to get more knowledge on how the timeperiods work, however last night and once before on the 23rd I was back to square one unfortunately.

This is the timeperiod proposed:

Code: Select all

define timeperiod{
        timeperiod_name     hr-s3bt-nm
        alias               LService
        sunday              22:00-24:00
        monday              00:00-06:00
        monday              22:00-24:00
        tuesday             00:00-06:00
        tuesday             22:00-24:00
        wednesday           00:00-06:00
        wednesday           22:00-24:00
        thursday            00:00-06:00
        thursday            22:00-24:00
        friday              00:00-06:00
        friday              22:00-24:00
        saturday            00:00-06:00
        saturday            22:00-24:00
        sunday              00:00-06:00
        }
But last night at around ten to one in the morning I was called and when looked at my custom logs I found the following entry:

Code: Select all

[07-25-2018-21:06:29] - LService can't be resumed during working hours. 0
IO manger pid: 49773
IO manger pid: 56727
[07-26-2018-00:46:29] - LService can't be resumed during working hours. 0
The 0 at the end is the value that I captured assigned to $ISVALIDTIME:hr-s3bt-nm$ at that moment and how you can notice it is correct at 07-25-2018-21:06:29 but not at 07-26-2018-00:46:29.
So I am not sure what's going on and why the macro gets the wrong exit code assigned.
If anyone could help me with this it would be greatly appreciated.

p.s. the issue seems (maybe) to be with the 00:00-06:00 timeperiod entry since I noticed that the service was resumed a couple of times when it was down within the 22:00-24:00 one.

Thank you.
"It is impossible to work in information technology without also engaging in social engineering"
Jaron Lanier
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Event Handler stopped working! Part 3

Post by scottwilkerson »

Can you share your event handler command as well as the script the event handler is running?
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
User avatar
Pitone_Maledetto
Posts: 69
Joined: Fri Jul 01, 2016 4:11 am
Location: Liverpool, United Kingdom

Re: Event Handler stopped working! Part 3

Post by Pitone_Maledetto »

Hi Scott,

This is the script in resume-hr-lservice:

Code: Select all

#   $SERVICESTATE$=$1
#   $SERVICESTATETYPE$=$2
#   $SERVICEATTEMPT$=$3
#   $HOSTADDRESS$=$4
#   $ISVALIDTIME:hr-s3bt-nm$=$5

NOW=$(date +"%m-%d-%Y-%T")
TIMESTAMP=$(date +%s)

# What state is the service in?
case "$1" in
OK)
    # The service just came back up, so don't do anything...
    ;;

WARNING)
    # Warning statuses are triggered if LService can't be resumed because for example it has been manually stopped.
    ;;

CRITICAL)
    # Is this a "soft" or a "hard" state?
    case "$2" in

    HARD)
        case "$3" in
        1)
            if [ $5 = 1 ] ; then
            # Trying to resume LService.
            /usr/local/nagios/libexec/check_by_ssh -t 120 -H $4 -l nagios -C "sudo /usr/local/bin/start_LService.sh" >> /tmp/resume.txt 2>&1
            elif [ $5 = 0 ] ; then
            # Timeperiod is invalid.
            echo "[$NOW] - LService can't be resumed during working hours. $5" >> /tmp/resume.txt
            fi
            ;;
        esac
        ;;
    esac
    ;;

UNKNOWN)
    # We don't know what might be causing an unknown error, so don't do anything...
    ;;

esac
exit 0
This is the command:

Code: Select all

define command{
   command_name    resume-hr-lservice
   command_line    /usr/local/nagios/libexec/eventhandlers/resume-hr-lservice  $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $HOSTADDRESS$ $ISVALIDTIME:hr-s3bt-nm$
   }
Thank you
"It is impossible to work in information technology without also engaging in social engineering"
Jaron Lanier
User avatar
Pitone_Maledetto
Posts: 69
Joined: Fri Jul 01, 2016 4:11 am
Location: Liverpool, United Kingdom

Re: Event Handler stopped working! Part 3

Post by Pitone_Maledetto »

In the meantime I have changed the timeperiod to:

Code: Select all

define timeperiod{
    timeperiod_name    hr-s3bt-nm
    alias              LService
    sunday             06:00-22:00
    monday             06:00-22:00
    tuesday            06:00-22:00
    wednesday          06:00-22:00
    thursday           06:00-22:00
    friday             06:00-22:00
    saturday           06:00-22:00
    }
And changed the:

Code: Select all

if [ $5 = 0 ]
To resume. If 1 don't.
So the inverse with a different timeperiod just to test the theory.
regards
"It is impossible to work in information technology without also engaging in social engineering"
Jaron Lanier
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Event Handler stopped working! Part 3

Post by scottwilkerson »

Pitone_Maledetto wrote:In the meantime I have changed the timeperiod to:

Code: Select all

define timeperiod{
    timeperiod_name    hr-s3bt-nm
    alias              LService
    sunday             06:00-22:00
    monday             06:00-22:00
    tuesday            06:00-22:00
    wednesday          06:00-22:00
    thursday           06:00-22:00
    friday             06:00-22:00
    saturday           06:00-22:00
    }
And changed the:

Code: Select all

if [ $5 = 0 ]
To resume. If 1 don't.
So the inverse with a different timeperiod just to test the theory.
regards
Actually I think this solution looks much more elegant than even what I proposed.
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
User avatar
Pitone_Maledetto
Posts: 69
Joined: Fri Jul 01, 2016 4:11 am
Location: Liverpool, United Kingdom

Re: Event Handler stopped working! Part 3

Post by Pitone_Maledetto »

Hi @scottwilkerson,
It all seems to be fine now.
It would be interesting to see why the original solution with the long timeperiod did not work as expected.
Anyhow I am happy to close this thread.
Thank you very much for your help.
Regards
"It is impossible to work in information technology without also engaging in social engineering"
Jaron Lanier
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Event Handler stopped working! Part 3

Post by scottwilkerson »

Great Locking.

I actually am not sure why the other timeperiod I suggested didn't work
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
Locked