Two questions, check command and notifications

greyclear · Post by **greyclear** » Thu Sep 25, 2014 9:07 am

Testing this out and it works from the command line http://exchange.nagios.org/directory/Pl ... us/details

./check_snmp_synology -h 10.2.2.4 -v
Synology model:    "DS213j"
Synology s/n:      "1440LAN010857"
DSM Version:       "DSM 5.0-4493"
System Status:     Normal
Power Status:      Normal
System Fan Status: Normal
CPU Fan Status:    Normal
Number of disks:   2
 "Disk 1" (model: "ST4000DM000-1F2168      ") status:Normal temperature:31 C
 "Disk 2" (model: "ST4000DM000-1F2168      ") status:Normal temperature:34 C
Number of RAID volume: 1
 "Volume 1" status:Normal
OK - Synology  "DS213j" (s/n: "1440LAN010857",  "DSM 5.0-4493") is in good health

Sure there is something wrong with my checks

commands

Code: Select all

define command{
command_name check_snmp_synology
command_line $USER1$/check_snmp_synology -H $HOSTADDRESS$ -v $ARG1$ $ARG2$
}

Code: Select all

# Service Defs
define service{
	use generic-service
	host_name 10.2.2.4
	service_description Hardware
	check_command check_snmp_synology
}

And second question. I am not getting a notification when one of our networks can't ping. Otherwise it works just fine

Code: Select all

define host{
	use		generic-switch			; Inherit default values from a template
	host_name	xxxx 10.10.6.1		; The name we're giving to this switch
	alias		xxxx Gateway			; A longer name associated with the switch
	address		10.10.6.1				; IP address of the switch
	hostgroups	Stl-Sonicwall				; Host groups this switch is associated with
	}

Code: Select all

define service{
	use						generic-service
	host_name					xxxx 10.10.6.1
	service_description				PING
	check_command					check_ping!200.0,20%!600.0,60%
	normal_check_interval	1
	retry_check_interval	1
	}

Code: Select all

define host{
	name			generic-switch	; The name of this host template
	use				generic-host	; Inherit default values from the generic-host template
	check_period		24x7		; By default, switches are monitored round the clock
	check_interval		5		; Switches are checked every 5 minutes
	retry_interval		1		; Schedule host check retries at 1 minute intervals
	max_check_attempts	10		; Check each switch 10 times (max)
	check_command		check-host-alive	; Default command to check if routers are "alive"
	notification_period	24x7		; Send notifications at any time
	notification_interval	30		; Resend notifications every 30 minutes
	notification_options	d,r		; Only send notifications for specific host states
	contact_groups		admins		; Notifications get sent to the admins by default
	register		0		; DONT REGISTER THIS - ITS JUST A TEMPLATE
	}

Post by **Box293** » Thu Sep 25, 2014 3:15 pm

greyclear wrote:Sure there is something wrong with my checks

On your command line:

greyclear wrote:./check_snmp_synology -h 10.2.2.4 -v

In your command definition:

greyclear wrote:command_line $USER1$/check_snmp_synology -H $HOSTADDRESS$ -v $ARG1$ $ARG2$

I think you need a lower case h in your command definition.

greyclear wrote:And second question. I am not getting a notification when one of our networks can't ping. Otherwise it works just fine

Do you mean when your host object goes down or the service object for the host? Which one are you expecting a notification from?

greyclear · Post by **greyclear** » Fri Sep 26, 2014 7:55 am

Capital H lower case h.. I feel like an idiot lol. The switch only has one function so when it is unreachable Id like for it to notify

Post by **Box293** » Fri Sep 26, 2014 10:13 am

Another pair of eyes can solve problems quickly

greyclear wrote:I am not getting a notification when one of our networks can't ping

Here are the check intervals for the host object being inherited from the template:

Code: Select all

check_interval      5      ; Switches are checked every 5 minutes
retry_interval      1      ; Schedule host check retries at 1 minute intervals
max_check_attempts   10      ; Check each switch 10 times (max)

So the host needs to enter a HARD state before it sends notifications. Here is how the host enters a hard state:

10:00am = Nagios ping check, ping received OK, next scheduled check is 10.05am
10.03am = switch goes down, Nagios does not know about it yet
10:05am = Nagios ping check, ping failed, current check 1/10, host enters SOFT state, next scheduled check is 10.06am
10:06am = Nagios ping check, ping failed, current check 2/10, host still in SOFT state, next scheduled check is 10.07am
10:07am = Nagios ping check, ping failed, current check 3/10, host still in SOFT state, next scheduled check is 10.08am
10:08am = Nagios ping check, ping failed, current check 4/10, host still in SOFT state, next scheduled check is 10.09am
10:09am = Nagios ping check, ping failed, current check 5/10, host still in SOFT state, next scheduled check is 10.10am
10:10am = Nagios ping check, ping failed, current check 6/10, host still in SOFT state, next scheduled check is 10.11am
10:11am = Nagios ping check, ping failed, current check 7/10, host still in SOFT state, next scheduled check is 10.12am
10:12am = Nagios ping check, ping failed, current check 8/10, host still in SOFT state, next scheduled check is 10.13am
10:13am = Nagios ping check, ping failed, current check 9/10, host still in SOFT state, next scheduled check is 10.14am
10:14am = Nagios ping check, ping failed, current check 10/10, host enters HARD state, NOTIFICATION is sent, next scheduled check is 10.15am

So while the switch went down at 10.03am, Nagios detected it at 10.05 am BUT it wasn't until 10.14am when Nagios reached the maximum number of check attempts which allowed it to enter a HARD state and hence notifications are sent.

Does this answer the question as to why you are not receiving notifications?

greyclear · Post by **greyclear** » Mon Sep 29, 2014 8:05 am

Hmm the max retry check was on 3 attempts when I looked at it but not sure why it didnt log a hard state. I have secondary system in place that monitors our switch, PingPlotter which we are doing away with at some point. Other hosts and services are notifying just fine. Is there possibly something else I could look at?

Post by **Box293** » Mon Sep 29, 2014 1:41 pm

The example I showed you with the max_check_attempts set to 10 was for the HOST object.

Perhaps you were looking at the SERVICE object. I notice that your service object definition does not have max_check_attempts defined. It does use the template "generic-service" however you have not posted that template definition so I can't confirm that.

greyclear · Post by **greyclear** » Tue Sep 30, 2014 7:54 am

Ohhhhh, ok.

Code: Select all

# Generic service definition template - This is NOT a real service, just a template!

define service{
        name                            generic-service 	; The 'name' of this service template
        active_checks_enabled           1       		; Active service checks are enabled
        passive_checks_enabled          1    		   	; Passive service checks are enabled/accepted
        parallelize_check               1       		; Active service checks should be parallelized (disabling this can lead to major performance problems)
        obsess_over_service             1       		; We should obsess over this service (if necessary)
        check_freshness                 0       		; Default is to NOT check service 'freshness'
        notifications_enabled           1       		; Service notifications are enabled
        event_handler_enabled           1       		; Service event handler is enabled
        flap_detection_enabled          1       		; Flap detection is enabled
        process_perf_data               1       		; Process performance data
        retain_status_information       1       		; Retain status information across program restarts
        retain_nonstatus_information    1       		; Retain non-status information across program restarts
        is_volatile                     0       		; The service is not volatile
        check_period                    24x7			; The service can be checked at any time of the day
        max_check_attempts              3			; Re-check the service up to 3 times in order to determine its final (hard) state
        normal_check_interval           1			; Check the service every 10 minutes under normal conditions
        retry_check_interval            2			; Re-check the service every two minutes until a hard state can be determined
        contact_groups                  admins			; Notifications get sent out to everyone in the 'admins' group
		notification_options			w,u,c,r			; Send notifications about warning, unknown, critical, and recovery events
        notification_interval           60			; Re-notify about service problems every hour
        notification_period             workhours			; Notifications can be sent out at any time
        register                        0       		; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
        }


# Local service definition template - This is NOT a real service, just a template!

define service{
		name							local-service 		; The name of this service template
		use								generic-service		; Inherit default values from the generic-service definition
        max_check_attempts              4			; Re-check the service up to 4 times in order to determine its final (hard) state
        normal_check_interval           5			; Check the service every 5 minutes under normal conditions
        retry_check_interval            1			; Re-check the service every minute until a hard state can be determined
        register                        0       		; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
	}

Also since I posted my synology check stopped working :/

Code: Select all


./check_snmp_synology -h 10.2.2.4 -v
Synology model:    "DS213j"
Synology s/n:      "1440LAN010857"
DSM Version:       "DSM 5.0-4493"
System Status:     Normal
Power Status:      Normal
System Fan Status: Normal
CPU Fan Status:    Normal
Number of disks:   2
 "Disk 1" (model: "ST4000DM000-1F2168      ") status:Normal temperature:33 C
 "Disk 2" (model: "ST4000DM000-1F2168      ") status:Normal temperature:37 C
Number of RAID volume: 1
 No Such Instance currently exists at this OID status:No Such Instance currently exists at this OID
CRITICAL - Synology  "DS213j" (s/n: "1440LAN010857",  "DSM 5.0-4493"), RAID status ( No Such Instance currently exists at this OID ): No Such Instance currently exists at this OID

tmcdonald · Post by **tmcdonald** » Tue Sep 30, 2014 10:37 am

Let's do an snmpwalk against the device. For a v2 device, run the following:

Code: Select all

snmpwalk -v 2c -c <community> 10.2.2.4

of course replacing <community> with the respective value. I believe the plugin requires it to be "public". Be sure to censor any sensitive information in the output.

Also, did you power cycle the device recently? Sometimes that can shuffle around the OIDs.

greyclear · Post by **greyclear** » Tue Sep 30, 2014 10:58 am

Nope

DiskStationSTL> uptime
08:56:36 up 10 days, 18:27, load average: 3.27, 2.75, 2.38
DiskStationSTL>

snmpwalk results attached

tmcdonald · Post by **tmcdonald** » Tue Sep 30, 2014 4:38 pm

Let's try this with bash debugging. Open up the plugin and change the first line from:

Code: Select all

#!/bin/bash

to

Code: Select all

#!/bin/bash -x

and run it again, posting the output here in code tags. This should tell us where it is failing.

Nagios Support Forum

Two questions, check command and notifications

Two questions, check command and notifications

Re: Two questions, check command and notifications

Re: Two questions, check command and notifications

Re: Two questions, check command and notifications

Re: Two questions, check command and notifications

Re: Two questions, check command and notifications

Re: Two questions, check command and notifications

Re: Two questions, check command and notifications

Re: Two questions, check command and notifications

Re: Two questions, check command and notifications