Page 1 of 2

issues with Alert Log File

Posted: Thu Sep 15, 2016 3:45 am
by neworderfac33
Good morning,

We've recently noticed that we've not been receiving alert/recovery emails when trying to monitor the status of the IIS service on a number of servers.

Yesterday afternoon, I ran a sync operation on six servers which included stopping of the IIS service, copying files, then restarting IIS at the end of the process.

However, I only received Alert/Recovery emails for two of those servers. I waited until this morning to check that the emails weren't stuck in the Exchange server's buffer, but whilst other, later alert/recovery emails have been sent, there's no sign of the eight missing emails (four alert and four recovery) that I would have expected to see.

I have gone into the history for the service on the six servers in question, and all but one say "No history information was found for this service in the current log file". The two for which the emails WERE sent also say this, although they didn't yesterday afternoon. The one for which there IS a history shows that the service went down at 05:28 this morning and came back up 5 minutes later, but no emails were sent for this occurrence either.

Here's the definition of the service:

Code: Select all

define service{
       use                     generic-service
       hostgroup_name  999        
       service_description     Service -  W3SVC/IIS
       check_command           check_nt!SERVICESTATE!-d SHOWALL -l W3SVC
       }
and here;s my version of "generic-service"

Code: Select all

define service{
        name                            generic-service         ; The 'name' of this service template
        active_checks_enabled           1                       ; Active service checks are enabled
        passive_checks_enabled          1                       ; Passive service checks are enabled/accepted
        parallelize_check               1                       ; Active service checks should be parallelized (disabling this can lead to major performance problems)
        obsess_over_service             1                       ; We should obsess over this service (if necessary)
        check_freshness                 0                       ; Default is to NOT check service 'freshness'
        notifications_enabled           1                       ; Service notifications are enabled
        event_handler_enabled           1                       ; Service event handler is enabled
        flap_detection_enabled          1                       ; Flap detection is enabled
        process_perf_data               1                       ; Process performance data
        retain_status_information       1                       ; Retain status information across program restarts
        retain_nonstatus_information    1                       ; Retain non-status information across program restarts
        is_volatile                     0                       ; The service is not volatile
        check_period                    24x7                    ; The service can be checked at any time of the day
        max_check_attempts              3                       ; Re-check the service upto 3 times to determine its final (hard) state
        normal_check_interval           1                       ; Check the service every 5 minutes under normal conditions
        retry_check_interval            5                       ; Re-check the service every 2 minutes until a hard state can be determined
        contact_groups                  admins                  ; Notifications get sent out to everyone in the 'admins' group
        notification_options            w,u,c,r                 ; Send notifications about warning, unknown, critical, and recovery events
        notification_interval           0                       ; Send notifications every xx minutes - 0 for FIRST notification only
        notification_period             24x7                    ; Notifications can be sent out at any time
        register                        0                       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
        }
I'm polling every minute - when we do our syncs, they normally take around five minutes, so if I go much beyond this in my normal_check_interval, there's a risk of me missing the IIS service going down and coming back up again completely.

The worrying thing is that it's not as though ALL my emails are not being sent, which would mean that something more specific was wrong - for instance, I'm receiving lots of emails for disk space, CPU usage etc, but at the moment, whilst the web interface worked fine during the sync, correctly showing the service as going down then recovering on each server, the intermittent email side of things is looking decidedly iffy.

Can anyone think of anything obvious that I might try to rectify this state of affairs - is my logfile getting full and is there anything I can do to clear it down?

As always, thanks in advance for your help.

Pete

Re: issues with Alert Log File

Posted: Thu Sep 15, 2016 11:47 am
by rkennedy
Just to confirm - when you look at the services, did you see the actual state changes appear in XI?

Re: issues with Alert Log File

Posted: Fri Sep 16, 2016 3:05 am
by neworderfac33
Yes I did - and it's Core, not XI.

Re: issues with Alert Log File

Posted: Fri Sep 16, 2016 10:33 am
by rkennedy
Ack, my bad on that part.

Got it -when you look at the Notifications tab on Core, do they show as sent or not sent? Just trying to figure out if it's a mail issue or configuration issue at this point.

Re: issues with Alert Log File

Posted: Fri Sep 16, 2016 11:45 am
by neworderfac33
An update of sorts - I have developed a Powershell script to stop then restart the IIS service (W3SVC) with a specified delay in between.

Although the webpage always correctly shows when the service has gone down then back up again, no emails are sent if the delay between stop and start is less than five minutes - BUT, there is only one minute between the alert and recovery emails being sent out - might this be explained by some sort of buffering on the Exchange server?

I'm totally stumped, It's Friday afternoon and I'm going home! here's my generic-service as it currently stands:

Code: Select all

define service{
        name                            generic-service         ; The 'name' of this service template
        active_checks_enabled           1                       ; Active service checks are enabled
        passive_checks_enabled          1                       ; Passive service checks are enabled/accepted
        parallelize_check               1                       ; Active service checks should be parallelized (disabling this can lead to major performance problems)
        obsess_over_service             1                       ; We should obsess over this service (if necessary)
        check_freshness                 0                       ; Default is to NOT check service 'freshness'
        notifications_enabled           1                       ; Service notifications are enabled
        event_handler_enabled           1                       ; Service event handler is enabled
        flap_detection_enabled          1                       ; Flap detection is enabled
        process_perf_data               1                       ; Process performance data
        retain_status_information       1                       ; Retain status information across program restarts
        retain_nonstatus_information    1                       ; Retain non-status information across program restarts
        is_volatile                     0                       ; The service is not volatile
        check_period                    24x7                    ; The service can be checked at any time of the day
        max_check_attempts              5                       ; Re-check the service upto 3 times to determine its final (hard) state
        normal_check_interval           1                       ; Check the service every 5 minutes under normal conditions
        retry_check_interval            1                       ; Re-check the service every 2 minutes until a hard state can be determined
        contact_groups                  admins                  ; Notifications get sent out to everyone in the 'admins' group
        notification_options            w,u,c,r                 ; Send notifications about warning, unknown, critical, and recovery events
        notification_interval           0                       ; Send notifications every xx minutes - 0 for FIRST notification only
        notification_period             24x7                    ; Notifications can be sent out at any time
        register                        0                       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
        }
Have a good weekend all.

pete

Re: issues with Alert Log File

Posted: Fri Sep 16, 2016 1:43 pm
by rkennedy
Could you please look at my previous response?
Got it -when you look at the Notifications tab on Core, do they show as sent or not sent? Just trying to figure out if it's a mail issue or configuration issue at this point.
This should help to identify if Nagios is actually firing the notifications, or if it's on the Exchange side.

Re: issues with Alert Log File

Posted: Mon Sep 19, 2016 4:07 am
by neworderfac33
Apologies for misreading your post - I repeated my test taking the service down for 3, 4 and 5 minutes and no notification alerts were generated until it was down for 5 minutes.
At this point, I both got the alerts and received the emails.
So, it looks like the problem lies at the Nagios end and not with the Exchange server.
Thanks
Pete

Re: issues with Alert Log File

Posted: Mon Sep 19, 2016 9:58 am
by rkennedy
You might want to adjust these variables (which it seems you already have) -

Code: Select all

        max_check_attempts              3                       ; Re-check the service upto 3 times to determine its final (hard) state
        normal_check_interval           1                       ; Check the service every 5 minutes under normal conditions
        retry_check_interval            5                       ; Re-check the service every 2 minutes until a hard state can be determined
Nagios will not go into a hard state until the check fails 3 times, which is when it will go into a hard state. I would adjust this number down to 1 if you want it to be in a hard state after the first failure.

Re: issues with Alert Log File

Posted: Tue Sep 20, 2016 9:54 am
by neworderfac33
When I last ran my tests, the timings were as follows:

Code: Select all

 max_check_attempts              5                      
        normal_check_interval           1                
        retry_check_interval            1         
Thanks
Pete

Re: issues with Alert Log File

Posted: Tue Sep 20, 2016 1:27 pm
by rkennedy
peterooney wrote:When I last ran my tests, the timings were as follows:

Code: Select all

 max_check_attempts              5                      
        normal_check_interval           1                
        retry_check_interval            1         
Thanks
Pete
no notification alerts were generated until it was down for 5 minutes.
This lines up with exactly what you previously mentioned, as it will not go into a hard state until 5 checks run every minute. If you want it to send right away, it may be worth adjusting your max_check_attempts to 1, which would put it into a hard state right away I believe.