Service notification timed out after 30 seconds

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
pilotmc
Posts: 21
Joined: Tue May 23, 2017 3:33 pm

Service notification timed out after 30 seconds

Post by pilotmc »

Hello, all.

I have a slack notification script which makes a curl call to an API endpoint to pass on alert information to Slack.
It's been working up to about 2 weeks ago, but now it is giving me this error:

[1495568715] Warning: Contact 'slack' service notification command '/usr/local/bin/slack_nagios.sh > /tmp/slack.log 2>&1' timed out after 30 seconds

I've made a test curl call that I has fake alert data which I use to actually test the service, and this always works and completes momentarily.
However, Nagios can't seem to do whatever it does within 30 seconds.

The problem is, how do I troubleshoot this? We have made some significant LAN changes which required me to re-IP all our LAN nodes, and therefore update the configs for all the monitoring. THere was a week where there was no NAT entry for the nagios server, so it was unable to reach the Slack API. That is resolved, but still there are these timeouts.

The only other local change is I've added a couple more node configs for new servers that need monitoring. I did notice that one I added never showed up in Nagios at all. I removed that definition.

There ARE alerts being generated. If I go to the Nagios web page I can see where Critical alerts are being generated. My contact group has both root email and the slack notification, and root is receiving the email notifications just fine.

Things I've tried:

1) nagios3 -v /etc/nagios3/nagios.cfg returns clean.
2) Restarted the nagios service
3) restarted the server on which nagios is installed.
4) made sure NRPE calls work to my nodes (I use mostly NRPE)
5) Turned on debuging to 48, which should cover notifications and service errors (I think?)

On strange thing.. I did find one config where I had an unused host (e.g. no service was configured for it). When I removed that host definition, the Slack alerts worked for a while. I then added in a couple new nodes and it broke again. Strange thing was that there was no error in nagions3 -v or in the logs.

The slack log shows a 29-second attempt to curl the data:
root@monitor:/var/log/nagios3# cat /tmp/slack.log
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:29 --:--:-- 0root@monitor:/var/log/nagios3#

So it kind of appears to be a true timeout issue, but I've manually done the curl call as both root and the nagios user, and it always works immediately, so it's hard to use that for troubleshooting.
What other tecniques can I use to get a better picture of the call chain and what data is actually being passed from the notification service to the slack script?

Thanks.
#Mikec
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Service notification timed out after 30 seconds

Post by tgriep »

If you run the script in a shell, how long does it take to run?
If it is around 30 seconds consistently, you may want to edit the nagios.cfg file and increase the notification timeout from
notification_timeout=30
to
notification_timeout=60
Save the file and restart Nagios.

If the server / hosts had to have the IP addresses changed, it could be a routing or a DNS issue that is causing the delay. Verify that the DNS sever that the Nagios server uses has valid entries and see it that is the cause of the timeouts.
Be sure to check out our Knowledgebase for helpful articles and solutions!
pilotmc
Posts: 21
Joined: Tue May 23, 2017 3:33 pm

Re: Service notification timed out after 30 seconds

Post by pilotmc »

Thanks for the reply.
Running the script is instantaneous from the command line.
The script is in /usr/local/bin, and the nagios user can execute it.
DNS query on the hostname does return the right IP address.

The script dumps it's output to a log file in /tmp, which looks like an attempt to connect with is timing out:

Code: Select all

pilotmc@monitor:/tmp$ cat slack.log 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:29 --:--:--     0
pilotmc@monitor:/tmp$ 
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Service notification timed out after 30 seconds

Post by tgriep »

One thing to check is if the script is using environment variables for the CURL command, make sure they are getting passed to it correctly.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked