Monitoring Windows Event Logs
Posted: Thu Dec 15, 2011 6:23 pm
So here's what is going on...
I'm using NSClient++ to check event logs on Windows servers. Here is the command I've defined in my commands.cfg file:
define command{
command_name check_log
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -p 5666 -c CheckEventLog -t 30 -a file="system" filter=new filter=out MaxWarn=1 MaxCrit=1 filter-generated=\>1h filter+severity==error filter-severity==success filter-severity==informational filter=in filter=all truncate=1023 unique descriptions "syntax=%severity%: %source%: %message% (%count%)"
}
And here is how I execute the check in my windows.cfg file:
define service{
use generic-service
host_name s-cdc-01.corp.liveops.com
service_description System-Event-Log
check_command check_log
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
is_volatile 0 ; The service is not volatile
check_period 24x7 ; The service can be checked at any time of the day
max_check_attempts 3 ; Re-check the service up to 3 times in order to determine its final (hard) state
normal_check_interval 5 ; Check the service every 10 minutes under normal conditions
retry_check_interval 2 ; Re-check the service every two minutes until a hard state can be determined
contact_groups admins ; Notifications get sent out to everyone in the 'admins' group
notification_options w,u,c,r,f,s ; Send notifications about warning, unknown, critical, and recovery events
notification_interval 5 ; Re-notify about service problems every hour
notification_period 24x7 ; Notifications can be sent out at any time
register 1
}
The problem is that I don't always get notified when an error shows up in the Event Log, or I'll get notified for an error that I don't see. For example, here's a message Nagios now generates and sends me:
***** Nagios *****
Notification Type: PROBLEM
Service: System-Event-Log
Host: S-CDC-01
Address: 192.168.152.14
State: CRITICAL
Date/Time: Wed Dec 14 11:21:12 PST 2011
Additional Info:
error: DCOM: (1), eventlog: 1 critical
If I look at the event log, though, I don't see any DCOM errors. I also see other errors I should have been notified of, but wasn't.
I'm only testing this on one server right now on one log, the system log. I was running it against the application log yesterday and getting lots of responses...and again for things I didn't see. If I read my query right, it should be returning errors it sees in the last hour. I was getting responses on errors I wasn't seeing at all.
I just need some help locking this down to get a competent query that returns useful information.
I'm using NSClient++ to check event logs on Windows servers. Here is the command I've defined in my commands.cfg file:
define command{
command_name check_log
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -p 5666 -c CheckEventLog -t 30 -a file="system" filter=new filter=out MaxWarn=1 MaxCrit=1 filter-generated=\>1h filter+severity==error filter-severity==success filter-severity==informational filter=in filter=all truncate=1023 unique descriptions "syntax=%severity%: %source%: %message% (%count%)"
}
And here is how I execute the check in my windows.cfg file:
define service{
use generic-service
host_name s-cdc-01.corp.liveops.com
service_description System-Event-Log
check_command check_log
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
is_volatile 0 ; The service is not volatile
check_period 24x7 ; The service can be checked at any time of the day
max_check_attempts 3 ; Re-check the service up to 3 times in order to determine its final (hard) state
normal_check_interval 5 ; Check the service every 10 minutes under normal conditions
retry_check_interval 2 ; Re-check the service every two minutes until a hard state can be determined
contact_groups admins ; Notifications get sent out to everyone in the 'admins' group
notification_options w,u,c,r,f,s ; Send notifications about warning, unknown, critical, and recovery events
notification_interval 5 ; Re-notify about service problems every hour
notification_period 24x7 ; Notifications can be sent out at any time
register 1
}
The problem is that I don't always get notified when an error shows up in the Event Log, or I'll get notified for an error that I don't see. For example, here's a message Nagios now generates and sends me:
***** Nagios *****
Notification Type: PROBLEM
Service: System-Event-Log
Host: S-CDC-01
Address: 192.168.152.14
State: CRITICAL
Date/Time: Wed Dec 14 11:21:12 PST 2011
Additional Info:
error: DCOM: (1), eventlog: 1 critical
If I look at the event log, though, I don't see any DCOM errors. I also see other errors I should have been notified of, but wasn't.
I'm only testing this on one server right now on one log, the system log. I was running it against the application log yesterday and getting lots of responses...and again for things I didn't see. If I read my query right, it should be returning errors it sees in the last hour. I was getting responses on errors I wasn't seeing at all.
I just need some help locking this down to get a competent query that returns useful information.