check_snmp_process_wizard.pl lag?

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

check_snmp_process_wizard.pl lag?

Post by jbennett »

I have a handfull of Linux servers that I'm using this check (v2c) on.

I am checking two processes on these servers.

Of all of the service checks I have running across our system, these are the only two that seem to have a lag.

They will constantly show up as being down, but for no more than about 30 seconds. When I check, they are running on the server in question just fine.

I'm wondering if this is a known issue and if so, can I run a different, better optimized check instead?
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: check_snmp_process_wizard.pl lag?

Post by abrist »

Are the servers in question under heavy load? You may have to increase the timeout on the check to accommodate a server with strict preemption under heavy load.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Re: check_snmp_process_wizard.pl lag?

Post by jbennett »

abrist wrote:Are the servers in question under heavy load? You may have to increase the timeout on the check to accommodate a server with strict preemption under heavy load.
Not really. Here's an example on one of the boxes that just alerted then went away:

Code: Select all

# uptime
 10:47:34 up 176 days, 22:05,  1 user,  load average: 0.73, 0.88, 0.86
And another:

Code: Select all

# uptime
 10:52:22 up 31 days,  9:42,  1 user,  load average: 0.68, 0.83, 0.85
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: check_snmp_process_wizard.pl lag?

Post by abrist »

What services are you checking?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Re: check_snmp_process_wizard.pl lag?

Post by jbennett »

It's a process that is specific to our application. It's not a standard Linux process. Basically, an image capture and transfer process.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: check_snmp_process_wizard.pl lag?

Post by abrist »

Try passing a longer timeout than default (try 30 seconds or so):

Code: Select all

 -t, --timeout=INTEGER
    Seconds before connection times out (default: 10)
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Re: check_snmp_process_wizard.pl lag?

Post by jbennett »

I have upped this to 60 seconds but I'm still getting the alerts.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: check_snmp_process_wizard.pl lag?

Post by abrist »

Can we see the config file for one of the checks? Go to the CCM and click the "disk" image next to one of these service checks. Post the file in code wraps.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Re: check_snmp_process_wizard.pl lag?

Post by jbennett »

Hoping I've copied everything you would need.

Code: Select all

define service {
	host_name			{removed for bravarity}
	service_description		Video Capture
	use				xiwizard_linuxsnmp_process
	hostgroup_name			All Controllers - Ramps,All Controllers
	display_name			Video Capture
	servicegroups			Techs
	check_command			check_xi_service_snmp_linux_process!-C roadway --v2c -n 'ves_cap_trx' -w0,2 -t 60!!!!!!!
	register			1
	}	

Code: Select all

define service {
       name                          		xiwizard_linuxsnmp_process
       service_description           		xiwizard_linuxsnmp_process
       display_name                  		Linux SNMP Process Check
       use                           		xiwizard_generic_service
       check_command                 		check_xi_service_snmp_linux_process!!!!!!!!
       register                    		0

}

Code: Select all

define service {
       name                          		xiwizard_generic_service
       service_description           		xiwizard_generic_service
       display_name                  		Generic Service Check
       check_command                 		check_xi_service_none
       is_volatile                   		0
       max_check_attempts            		5
       check_interval                		5
       retry_interval                		1
       active_checks_enabled         		1
       passive_checks_enabled        		1
       check_period                  		xi_timeperiod_24x7
       parallelize_check             		1
       obsess_over_service           		1
       check_freshness               		0
       freshness_threshold           		1800
       event_handler                 		host-notify-by-email
       event_handler_enabled         		1
       flap_detection_enabled        		1
       process_perf_data             		1
       retain_status_information     		1
       retain_nonstatus_information  		1
       notification_interval         		60
       first_notification_delay      		60
       notification_period           		xi_timeperiod_24x7
       notifications_enabled         		1
       contacts                      		nagiosadmin
       contact_groups                		admins, Techs
       failure_prediction_enabled    		1
       register                    		0

}

Code: Select all

define command {
       command_name                  		check_xi_service_snmp_linux_process
       command_line                  		$USER1$/check_snmp_process_wizard.pl -H $HOSTADDRESS$ $ARG1$
}
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: check_snmp_process_wizard.pl lag?

Post by scottwilkerson »

jbennett wrote:I'm wondering if this is a known issue and if so, can I run a different, better optimized check instead?
Everything looks correct, but as this is a SNMP check it is utilizing UDP connections, and as this is stateless, packets can get dropped. this is likely what is happening.

With config you posted though, it shouldn't be sending notifications if it is only down for 30 seconds, it should be trying 5 times at 1 minute intervals before sending notification
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Locked