SNMP - error: Alarm signal (Nagios time-out)

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
jacek
Posts: 255
Joined: Wed Sep 09, 2015 5:49 am

SNMP - error: Alarm signal (Nagios time-out)

Post by jacek »

Hi,
I noticed getting these from time to time:
error: Alarm signal (Nagios time-out)
on SNMP service checks.

In the Nagios Core Configuration I changed the "service_check_timeout"

Code: Select all

service_check_timeout=120
service_freshness_check_interval=60
service_inter_check_delay_method=s
service_interleave_factor=s
And corrected the file:
/usr/local/nagios/libexec/check_snmp_process_wizard.pl
From
my $TIMEOUT = 15;
To:
my $TIMEOUT;
acc. to this thread:
https://support.nagios.com/forum/viewto ... d=b43289bb

This has solved some problems, they don't appear that often, but they are still there, but I think this started when I updated some components a few weeks ago, not sure for 100%.
Can we somehow investigate this? Attached two screens from Nagios Status for the start.
You do not have the required permissions to view the files attached to this post.
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: SNMP - error: Alarm signal (Nagios time-out)

Post by tgriep »

Can you run the following and post the results so we can get the version on the plugin you are running?

Code: Select all

/usr/local/nagios/libexec/check_snmp_process_wizard.pl -V
Also, you can edit the command and add increase the timeout to 60 seconds by using the -t option.
-t, --timeout=INTEGER
timeout for SNMP in seconds
Be sure to check out our Knowledgebase for helpful articles and solutions!
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: SNMP - error: Alarm signal (Nagios time-out)

Post by rkennedy »

Are the problems specifically with the check_snmp_process_wizard.pl plugin, or does it happen on a variety of different SNMP checks?

The reason I ask is because it may just be taking _that_long_ to receive an information back from your device.

Do you have the check running with -t 120? Another thing that might help is increasing service_check_timeout to 180.
Former Nagios Employee
jacek
Posts: 255
Joined: Wed Sep 09, 2015 5:49 am

Re: SNMP - error: Alarm signal (Nagios time-out)

Post by jacek »

Yeah, most of them are check_xi_service_snmp_win_process.
I would even say that all of them, not having this problem with switches and other SNMP devices, as far as I noticed.

Code: Select all

sudo /usr/local/nagios/libexec/check_snmp_process_wizard.pl -V
check_snmp_process version : 1.10
I would concider changing the timeout from 120 to 180 (as You can see I globaly changed to 120 from default 60), but I'm wondering what happened that all of sudden I would need to tripple the value.
I did not add an noticeable ammount of hosts and checks recently. Now we have ~230 hosts and ~630 services, before the problems maybe ~200 hosts and ~580 services?
I'm wondering that I have 16GB RAM and 4 CPU's (AWS m4.xlarge) and have ~9% free as You can see on the screens - is this something worth investigating?
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: SNMP - error: Alarm signal (Nagios time-out)

Post by lmiltchev »

Zip up the "check_snmp_process_wizard.pl" plugin that you are using, and upload it on the forum.

When you time the check, does the actual timeout match the value passed via the "-t" flag?

Example:

Code: Select all

time /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H x.x.x.x  -C wrong-community-string --v2c -n httpd -t 50
ERROR: Alarm signal (Nagios time-out)

real    0m50.039s
user    0m0.040s
sys     0m0.007s
What is the output of the following command?

Code: Select all

top | head -5
Be sure to check out our Knowledgebase for helpful articles and solutions!
jacek
Posts: 255
Joined: Wed Sep 09, 2015 5:49 am

Re: SNMP - error: Alarm signal (Nagios time-out)

Post by jacek »

Zipped file attached.
This is what I get when I try one of the services that had the Timeout a few hours ago (of course sanitized the input), to bad I don't have any services with the timeout error at this moment.

Code: Select all

time /usr/local/nagios/libexec/check_snmp_process_wizard.pl -H OURHOST.domain.com -C our_real_comm_string --v2c -n 'xxx-service-9.1.4.0.0-x86.exe' -t 50
1 process matching xxx-service-9.1.4.0.0-x86.exe (> 0)

real    0m0.153s
user    0m0.074s
sys     0m0.022s
And the top of the top below:

Code: Select all

top - 22:02:13 up 16 days,  6:58,  1 user,  load average: 0.53, 0.37, 0.31
Tasks: 165 total,   1 running, 164 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.8%us,  0.9%sy,  0.0%ni, 96.1%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  15371860k total, 15127388k used,   244472k free,   151292k buffers
Swap:   262136k total,        0k used,   262136k free,  1040312k cached
You do not have the required permissions to view the files attached to this post.
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: SNMP - error: Alarm signal (Nagios time-out)

Post by lmiltchev »

The "check_snmp_process_wizard.pl" plugin is almost identical to the one I am using. See the output of the diff below:

Code: Select all

[root@main-nagios-xi tmp]# diff check_snmp_process_wizard.pl /usr/local/nagios/libexec/check_snmp_process_wizard.pl                     27,28d26
< # my $TIMEOUT = 15  hardcoded timeout is a bad idea
<
334,335c332,333
<   verb("no timeout defined : $o_timeout + 10");
<   alarm ($o_timeout+10);
---
>   verb("no timeout defined : $o_timeout");
>   alarm($o_timeout);
The only difference is that we removed the "extra" 10 sec. that used to be added to the timeout. Now, the real timeout should match the value passed via the "-t" flag.

What happens if you time your check again using a wrong community string with "-t 50"? I would guess that the plugin will timeout after 60 sec.

FYI, you would see the "ERROR: Alarm signal (Nagios time-out)" error if for some reason snmpd is NOT running on the client. Other possible issues could be firewall problems, very high load on the client, etc.
Be sure to check out our Knowledgebase for helpful articles and solutions!
jacek
Posts: 255
Joined: Wed Sep 09, 2015 5:49 am

Re: SNMP - error: Alarm signal (Nagios time-out)

Post by jacek »

OK, giving a wrong community string ended up with:

Code: Select all

ERROR: Alarm signal (Nagios time-out)

real    1m0.081s
user    0m0.062s
sys     0m0.021s
just like You presumed.
I would not dump the fault to an firewall, as the problem is very irregular, and we have no firewalls between our "sites" (MPLS).
And the most important thing is that I'm getting this on different devices and on devices I did not have these issues before.
Is there an option that I would log these messages? Maybe this would be a good starter to eliminate client side problems or if these problems are client site then narrow them down?
What do You think about the top result?
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: SNMP - error: Alarm signal (Nagios time-out)

Post by tgriep »

Can you look at the performance graph for the host check and see if the ping rates are unusually high ( 5 Seconds) around the time the service check times out and see if there is a correlation?
Be sure to check out our Knowledgebase for helpful articles and solutions!
jacek
Posts: 255
Joined: Wed Sep 09, 2015 5:49 am

Re: SNMP - error: Alarm signal (Nagios time-out)

Post by jacek »

Sure, but this will take time as I don't know which service will be next...
BTW. when a timeout appears, then it mostly is only one service, despite we are checking for example 7 things via SNMP on the same host.
Shouldn't everything be failing at the same time?
Locked