Nagios Support Forum

Posted: **Wed Oct 05, 2016 8:26 am**

Hi,
I noticed getting these from time to time:

error: Alarm signal (Nagios time-out)

on SNMP service checks.

In the Nagios Core Configuration I changed the "service_check_timeout"

service_check_timeout=120
service_freshness_check_interval=60
service_inter_check_delay_method=s
service_interleave_factor=s

And corrected the file:
/usr/local/nagios/libexec/check_snmp_process_wizard.pl
From

my $TIMEOUT = 15;

To:

my $TIMEOUT;

acc. to this thread:
https://support.nagios.com/forum/viewto ... d=b43289bb

This has solved some problems, they don't appear that often, but they are still there, but I think this started when I updated some components a few weeks ago, not sure for 100%.
Can we somehow investigate this? Attached two screens from Nagios Status for the start.

Posted: **Wed Oct 05, 2016 1:05 pm**

Can you run the following and post the results so we can get the version on the plugin you are running?

Code: Select all

/usr/local/nagios/libexec/check_snmp_process_wizard.pl -V

Also, you can edit the command and add increase the timeout to 60 seconds by using the -t option.

-t, --timeout=INTEGER
timeout for SNMP in seconds

Posted: **Wed Oct 05, 2016 1:09 pm**

Are the problems specifically with the check_snmp_process_wizard.pl plugin, or does it happen on a variety of different SNMP checks?

The reason I ask is because it may just be taking _that_long_ to receive an information back from your device.

Do you have the check running with -t 120? Another thing that might help is increasing service_check_timeout to 180.

Posted: **Wed Oct 05, 2016 2:58 pm**

Yeah, most of them are check_xi_service_snmp_win_process.
I would even say that all of them, not having this problem with switches and other SNMP devices, as far as I noticed.

Code: Select all

sudo /usr/local/nagios/libexec/check_snmp_process_wizard.pl -V
check_snmp_process version : 1.10

I would concider changing the timeout from 120 to 180 (as You can see I globaly changed to 120 from default 60), but I'm wondering what happened that all of sudden I would need to tripple the value.
I did not add an noticeable ammount of hosts and checks recently. Now we have ~230 hosts and ~630 services, before the problems maybe ~200 hosts and ~580 services?
I'm wondering that I have 16GB RAM and 4 CPU's (AWS m4.xlarge) and have ~9% free as You can see on the screens - is this something worth investigating?

Posted: **Wed Oct 05, 2016 4:28 pm**

Zip up the "check_snmp_process_wizard.pl" plugin that you are using, and upload it on the forum.

When you time the check, does the actual timeout match the value passed via the "-t" flag?

Example:

Code: Select all

time /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H x.x.x.x  -C wrong-community-string --v2c -n httpd -t 50
ERROR: Alarm signal (Nagios time-out)

real    0m50.039s
user    0m0.040s
sys     0m0.007s

What is the output of the following command?

Code: Select all

top | head -5

Posted: **Wed Oct 05, 2016 5:06 pm**

Zipped file attached.
This is what I get when I try one of the services that had the Timeout a few hours ago (of course sanitized the input), to bad I don't have any services with the timeout error at this moment.

Code: Select all

time /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H OURHOST.domain.com -C our_real_comm_string --v2c -n 'xxx-service-9.1.4.0.0-x86.exe' -t 50
1 process matching xxx-service-9.1.4.0.0-x86.exe (> 0)

real    0m0.153s
user    0m0.074s
sys     0m0.022s

And the top of the top below:

Code: Select all

top - 22:02:13 up 16 days,  6:58,  1 user,  load average: 0.53, 0.37, 0.31
Tasks: 165 total,   1 running, 164 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.8%us,  0.9%sy,  0.0%ni, 96.1%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  15371860k total, 15127388k used,   244472k free,   151292k buffers
Swap:   262136k total,        0k used,   262136k free,  1040312k cached

Posted: **Thu Oct 06, 2016 9:51 am**

The "check_snmp_process_wizard.pl" plugin is almost identical to the one I am using. See the output of the diff below:

Code: Select all

[root@main-nagios-xi tmp]# diff check_snmp_process_wizard.pl /usr/local/nagios/libexec/check_snmp_process_wizard.pl                     27,28d26
< # my $TIMEOUT = 15  hardcoded timeout is a bad idea
<
334,335c332,333
<   verb("no timeout defined : $o_timeout + 10");
<   alarm ($o_timeout+10);
---
>   verb("no timeout defined : $o_timeout");
>   alarm($o_timeout);

The only difference is that we removed the "extra" 10 sec. that used to be added to the timeout. Now, the real timeout should match the value passed via the "-t" flag.

What happens if you time your check again using a wrong community string with "-t 50"? I would guess that the plugin will timeout after 60 sec.

FYI, you would see the "ERROR: Alarm signal (Nagios time-out)" error if for some reason snmpd is NOT running on the client. Other possible issues could be firewall problems, very high load on the client, etc.

Posted: **Fri Oct 07, 2016 8:31 am**

OK, giving a wrong community string ended up with:

Code: Select all

ERROR: Alarm signal (Nagios time-out)

real    1m0.081s
user    0m0.062s
sys     0m0.021s

just like You presumed.
I would not dump the fault to an firewall, as the problem is very irregular, and we have no firewalls between our "sites" (MPLS).
And the most important thing is that I'm getting this on different devices and on devices I did not have these issues before.
Is there an option that I would log these messages? Maybe this would be a good starter to eliminate client side problems or if these problems are client site then narrow them down?
What do You think about the top result?

Posted: **Fri Oct 07, 2016 1:12 pm**

Can you look at the performance graph for the host check and see if the ping rates are unusually high ( 5 Seconds) around the time the service check times out and see if there is a correlation?

Posted: **Sat Oct 08, 2016 7:52 am**

Sure, but this will take time as I don't know which service will be next...
BTW. when a timeout appears, then it mostly is only one service, despite we are checking for example 7 things via SNMP on the same host.
Shouldn't everything be failing at the same time?

Nagios Support Forum

SNMP - error: Alarm signal (Nagios time-out)

SNMP - error: Alarm signal (Nagios time-out)

Re: SNMP - error: Alarm signal (Nagios time-out)

Re: SNMP - error: Alarm signal (Nagios time-out)

Re: SNMP - error: Alarm signal (Nagios time-out)

Re: SNMP - error: Alarm signal (Nagios time-out)

Re: SNMP - error: Alarm signal (Nagios time-out)

Re: SNMP - error: Alarm signal (Nagios time-out)

Re: SNMP - error: Alarm signal (Nagios time-out)

Re: SNMP - error: Alarm signal (Nagios time-out)

Re: SNMP - error: Alarm signal (Nagios time-out)