Page 1 of 1

CRITICAL - Socket timeout after 10 seconds

Posted: Wed Aug 08, 2012 5:07 am
by jer1982
Hi everyone,

Need a little help.

Getting a CRITICAL - Socket timeout after 10 seconds error on all my services intermittently. Seems to happen every 10-12 hours or so.

Nagios core installed on CentOS 5.8 - using only passive checks on Windows servers and services.

Think the problem is probably related to NSCA on the CentOS machine as the fix for this is:

killall nsca
killall nsca
/usr/local/nagios/bin/nsca -c /usr/local/nagios/etc/nsca.cfg

Which kills 2 nsca processes then re-registers it (I think) I've discovered this through trial and error messing around trying to get it sorted

Not sure if it's related to the fact that nsca is running twice possibly?

My 2 questions are:

1). anyone have any ideas on a permanent fix for this?

2). Could I have a cronjob that just does killall nsca x 2 and then /usr/local/nagios/bin/nsca -c /usr/local/nagios/etc/nsca.cfg every 10 minutes or so - this should keep things running as doing that command when its working doesn't seem have any affect on Nagios

I tried putting the following in my etc/crontab:

# kill duplicate NSCA pid's
*/10 * * * * killall nsca
*/10 * * * * killall nsca
*/10 * * * * /usr/local/nagios/bin/nsca -c /usr/local/nagios/etc/nsca.cfg

Which I think should run those commands every 10 mins, but not sure if that will work?

Re: CRITICAL - Socket timeout after 10 seconds

Posted: Thu Aug 09, 2012 5:42 am
by jer1982
Checking in the NSCA .log file of the client machines sometimes reveals this:

2012-08-09 05:55:59: error:modules\NSCAAgent\NSCAThread.cpp:286: <<< Could not connect to: 46.20.226.250:5667 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2012-08-09 06:28:08: error:modules\NSCAAgent\NSCAThread.cpp:312: Timeout reading NSCA hdr packet (increase socket_timeout), we only got: 0

Although I definitely think the issue is on the Nagios server end as they all drop at exactly the same time, and killing NSCA and restarting it on the Nagios box kicks it back to life

Re: CRITICAL - Socket timeout after 10 seconds

Posted: Fri Oct 05, 2012 10:42 am
by jer1982
Still having this issue if anyone has any bright ideas.

I used the cronjob and it's been a little better since then but every now and then it still drops out with the same error.

Re: CRITICAL - Socket timeout after 10 seconds

Posted: Fri Oct 05, 2012 4:27 pm
by slansing
Strange, does this happen during any routine automated system maintenance? Is it at roughly the same time every time?

You could create a cron job to restart NSCA that is one idea. But lets see if you can dig up any information about things happening at the same time. Have you tried tailing the Nagios log's when this happens?

Re: CRITICAL - Socket timeout after 10 seconds

Posted: Mon Oct 08, 2012 10:46 am
by jer1982
No pattern that I can see - sometimes it's overnight or on the weekend, sometimes it's in the middle of the working day.

I've implemented the cronjob which restarts NSCA and that has seen an improvement but it still drops out every now and then.

I have tailed the log and basically all input stops, the log just freezes as though it's not receiving any information and nothing starts again until NSCA is killed and restarted, so it definitely points to NSCA issues.

Re: CRITICAL - Socket timeout after 10 seconds

Posted: Mon Oct 08, 2012 4:08 pm
by slansing
Are two NSCA instances spawning when it locks up?

Re: CRITICAL - Socket timeout after 10 seconds

Posted: Tue Oct 09, 2012 5:57 am
by jer1982
Seems that way - I have to killall nsca twice

Re: CRITICAL - Socket timeout after 10 seconds

Posted: Tue Oct 09, 2012 2:30 pm
by slansing
Try removing:

Code: Select all

*/10 * * * * /usr/local/nagios/bin/nsca -c /usr/local/nagios/etc/nsca.cfg
From your cron set up.

Re: CRITICAL - Socket timeout after 10 seconds

Posted: Wed Oct 10, 2012 4:29 am
by jer1982
I think if that line doesn't run then NSCA doesn't start at all and my passive checks aren't picked up - tailing the log without having run that leaves it basically empty, nothing being received

Re: CRITICAL - Socket timeout after 10 seconds

Posted: Wed Oct 10, 2012 1:12 pm
by slansing
If you start NSCA it only needs to start once, you have it set to start every 10 minutes hence why you are having the issue with multiple instances running. NSCA should start automatically by itself.