We've recently noticed that we've not been receiving alert/recovery emails when trying to monitor the status of the IIS service on a number of servers.
Yesterday afternoon, I ran a sync operation on six servers which included stopping of the IIS service, copying files, then restarting IIS at the end of the process.
However, I only received Alert/Recovery emails for two of those servers. I waited until this morning to check that the emails weren't stuck in the Exchange server's buffer, but whilst other, later alert/recovery emails have been sent, there's no sign of the eight missing emails (four alert and four recovery) that I would have expected to see.
I have gone into the history for the service on the six servers in question, and all but one say "No history information was found for this service in the current log file". The two for which the emails WERE sent also say this, although they didn't yesterday afternoon. The one for which there IS a history shows that the service went down at 05:28 this morning and came back up 5 minutes later, but no emails were sent for this occurrence either.
Here's the definition of the service:
Code: Select all
define service{
use generic-service
hostgroup_name 999
service_description Service - W3SVC/IIS
check_command check_nt!SERVICESTATE!-d SHOWALL -l W3SVC
}
Code: Select all
define service{
name generic-service ; The 'name' of this service template
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness'
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
is_volatile 0 ; The service is not volatile
check_period 24x7 ; The service can be checked at any time of the day
max_check_attempts 3 ; Re-check the service upto 3 times to determine its final (hard) state
normal_check_interval 1 ; Check the service every 5 minutes under normal conditions
retry_check_interval 5 ; Re-check the service every 2 minutes until a hard state can be determined
contact_groups admins ; Notifications get sent out to everyone in the 'admins' group
notification_options w,u,c,r ; Send notifications about warning, unknown, critical, and recovery events
notification_interval 0 ; Send notifications every xx minutes - 0 for FIRST notification only
notification_period 24x7 ; Notifications can be sent out at any time
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}
The worrying thing is that it's not as though ALL my emails are not being sent, which would mean that something more specific was wrong - for instance, I'm receiving lots of emails for disk space, CPU usage etc, but at the moment, whilst the web interface worked fine during the sync, correctly showing the service as going down then recovering on each server, the intermittent email side of things is looking decidedly iffy.
Can anyone think of anything obvious that I might try to rectify this state of affairs - is my logfile getting full and is there anything I can do to clear it down?
As always, thanks in advance for your help.
Pete