Nagios Support Forum

Posted: **Tue Mar 19, 2013 10:36 am**

I have a handfull of Linux servers that I'm using this check (v2c) on.

I am checking two processes on these servers.

Of all of the service checks I have running across our system, these are the only two that seem to have a lag.

They will constantly show up as being down, but for no more than about 30 seconds. When I check, they are running on the server in question just fine.

I'm wondering if this is a known issue and if so, can I run a different, better optimized check instead?

Posted: **Tue Mar 19, 2013 10:38 am**

Are the servers in question under heavy load? You may have to increase the timeout on the check to accommodate a server with strict preemption under heavy load.

Posted: **Tue Mar 19, 2013 10:48 am**

abrist wrote:Are the servers in question under heavy load? You may have to increase the timeout on the check to accommodate a server with strict preemption under heavy load.

Not really. Here's an example on one of the boxes that just alerted then went away:

Code: Select all

# uptime
 10:47:34 up 176 days, 22:05,  1 user,  load average: 0.73, 0.88, 0.86

And another:

Code: Select all

# uptime
 10:52:22 up 31 days,  9:42,  1 user,  load average: 0.68, 0.83, 0.85

Posted: **Tue Mar 19, 2013 10:55 am**

What services are you checking?

Posted: **Tue Mar 19, 2013 10:58 am**

It's a process that is specific to our application. It's not a standard Linux process. Basically, an image capture and transfer process.

Posted: **Tue Mar 19, 2013 12:42 pm**

Try passing a longer timeout than default (try 30 seconds or so):

Code: Select all

 -t, --timeout=INTEGER
    Seconds before connection times out (default: 10)

Posted: **Tue Mar 19, 2013 2:15 pm**

I have upped this to 60 seconds but I'm still getting the alerts.

Posted: **Tue Mar 19, 2013 2:48 pm**

Can we see the config file for one of the checks? Go to the CCM and click the "disk" image next to one of these service checks. Post the file in code wraps.

Posted: **Tue Mar 19, 2013 2:57 pm**

Hoping I've copied everything you would need.

Code: Select all

define service {
	host_name			{removed for bravarity}
	service_description		Video Capture
	use				xiwizard_linuxsnmp_process
	hostgroup_name			All Controllers - Ramps,All Controllers
	display_name			Video Capture
	servicegroups			Techs
	check_command			check_xi_service_snmp_linux_process!-C roadway --v2c -n 'ves_cap_trx' -w0,2 -t 60!!!!!!!
	register			1
	}

Code: Select all

define service {
       name                          		xiwizard_linuxsnmp_process
       service_description           		xiwizard_linuxsnmp_process
       display_name                  		Linux SNMP Process Check
       use                           		xiwizard_generic_service
       check_command                 		check_xi_service_snmp_linux_process!!!!!!!!
       register                    		0

}

Code: Select all

define service {
       name                          		xiwizard_generic_service
       service_description           		xiwizard_generic_service
       display_name                  		Generic Service Check
       check_command                 		check_xi_service_none
       is_volatile                   		0
       max_check_attempts            		5
       check_interval                		5
       retry_interval                		1
       active_checks_enabled         		1
       passive_checks_enabled        		1
       check_period                  		xi_timeperiod_24x7
       parallelize_check             		1
       obsess_over_service           		1
       check_freshness               		0
       freshness_threshold           		1800
       event_handler                 		host-notify-by-email
       event_handler_enabled         		1
       flap_detection_enabled        		1
       process_perf_data             		1
       retain_status_information     		1
       retain_nonstatus_information  		1
       notification_interval         		60
       first_notification_delay      		60
       notification_period           		xi_timeperiod_24x7
       notifications_enabled         		1
       contacts                      		nagiosadmin
       contact_groups                		admins, Techs
       failure_prediction_enabled    		1
       register                    		0

}

Code: Select all

define command {
       command_name                  		check_xi_service_snmp_linux_process
       command_line                  		$USER1$/check_snmp_process_wizard.pl -H $HOSTADDRESS$ $ARG1$
}

Posted: **Tue Mar 19, 2013 3:28 pm**

jbennett wrote:I'm wondering if this is a known issue and if so, can I run a different, better optimized check instead?

Everything looks correct, but as this is a SNMP check it is utilizing UDP connections, and as this is stateless, packets can get dropped. this is likely what is happening.

With config you posted though, it shouldn't be sending notifications if it is only down for 30 seconds, it should be trying 5 times at 1 minute intervals before sending notification

Nagios Support Forum

check_snmp_process_wizard.pl lag?

check_snmp_process_wizard.pl lag?

Re: check_snmp_process_wizard.pl lag?

Re: check_snmp_process_wizard.pl lag?

Re: check_snmp_process_wizard.pl lag?

Re: check_snmp_process_wizard.pl lag?

Re: check_snmp_process_wizard.pl lag?

Re: check_snmp_process_wizard.pl lag?

Re: check_snmp_process_wizard.pl lag?

Re: check_snmp_process_wizard.pl lag?

Re: check_snmp_process_wizard.pl lag?