Page 1 of 4
check_snmp_process_wizard.pl lag?
Posted: Tue Mar 19, 2013 10:36 am
by jbennett
I have a handfull of Linux servers that I'm using this check (v2c) on.
I am checking two processes on these servers.
Of all of the service checks I have running across our system, these are the only two that seem to have a lag.
They will constantly show up as being down, but for no more than about 30 seconds. When I check, they are running on the server in question just fine.
I'm wondering if this is a known issue and if so, can I run a different, better optimized check instead?
Re: check_snmp_process_wizard.pl lag?
Posted: Tue Mar 19, 2013 10:38 am
by abrist
Are the servers in question under heavy load? You may have to increase the timeout on the check to accommodate a server with strict preemption under heavy load.
Re: check_snmp_process_wizard.pl lag?
Posted: Tue Mar 19, 2013 10:48 am
by jbennett
abrist wrote:Are the servers in question under heavy load? You may have to increase the timeout on the check to accommodate a server with strict preemption under heavy load.
Not really. Here's an example on one of the boxes that just alerted then went away:
Code: Select all
# uptime
10:47:34 up 176 days, 22:05, 1 user, load average: 0.73, 0.88, 0.86
And another:
Code: Select all
# uptime
10:52:22 up 31 days, 9:42, 1 user, load average: 0.68, 0.83, 0.85
Re: check_snmp_process_wizard.pl lag?
Posted: Tue Mar 19, 2013 10:55 am
by abrist
What services are you checking?
Re: check_snmp_process_wizard.pl lag?
Posted: Tue Mar 19, 2013 10:58 am
by jbennett
It's a process that is specific to our application. It's not a standard Linux process. Basically, an image capture and transfer process.
Re: check_snmp_process_wizard.pl lag?
Posted: Tue Mar 19, 2013 12:42 pm
by abrist
Try passing a longer timeout than default (try 30 seconds or so):
Code: Select all
-t, --timeout=INTEGER
Seconds before connection times out (default: 10)
Re: check_snmp_process_wizard.pl lag?
Posted: Tue Mar 19, 2013 2:15 pm
by jbennett
I have upped this to 60 seconds but I'm still getting the alerts.
Re: check_snmp_process_wizard.pl lag?
Posted: Tue Mar 19, 2013 2:48 pm
by abrist
Can we see the config file for one of the checks? Go to the CCM and click the "disk" image next to one of these service checks. Post the file in code wraps.
Re: check_snmp_process_wizard.pl lag?
Posted: Tue Mar 19, 2013 2:57 pm
by jbennett
Hoping I've copied everything you would need.
Code: Select all
define service {
host_name {removed for bravarity}
service_description Video Capture
use xiwizard_linuxsnmp_process
hostgroup_name All Controllers - Ramps,All Controllers
display_name Video Capture
servicegroups Techs
check_command check_xi_service_snmp_linux_process!-C roadway --v2c -n 'ves_cap_trx' -w0,2 -t 60!!!!!!!
register 1
}
Code: Select all
define service {
name xiwizard_linuxsnmp_process
service_description xiwizard_linuxsnmp_process
display_name Linux SNMP Process Check
use xiwizard_generic_service
check_command check_xi_service_snmp_linux_process!!!!!!!!
register 0
}
Code: Select all
define service {
name xiwizard_generic_service
service_description xiwizard_generic_service
display_name Generic Service Check
check_command check_xi_service_none
is_volatile 0
max_check_attempts 5
check_interval 5
retry_interval 1
active_checks_enabled 1
passive_checks_enabled 1
check_period xi_timeperiod_24x7
parallelize_check 1
obsess_over_service 1
check_freshness 0
freshness_threshold 1800
event_handler host-notify-by-email
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
notification_interval 60
first_notification_delay 60
notification_period xi_timeperiod_24x7
notifications_enabled 1
contacts nagiosadmin
contact_groups admins, Techs
failure_prediction_enabled 1
register 0
}
Code: Select all
define command {
command_name check_xi_service_snmp_linux_process
command_line $USER1$/check_snmp_process_wizard.pl -H $HOSTADDRESS$ $ARG1$
}
Re: check_snmp_process_wizard.pl lag?
Posted: Tue Mar 19, 2013 3:28 pm
by scottwilkerson
jbennett wrote:I'm wondering if this is a known issue and if so, can I run a different, better optimized check instead?
Everything looks correct, but as this is a SNMP check it is utilizing UDP connections, and as this is stateless, packets can get dropped. this is likely what is happening.
With config you posted though, it shouldn't be sending notifications if it is only down for 30 seconds, it should be trying 5 times at 1 minute intervals before sending notification