CRITICAL - Socket timeout after 10 seconds

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
jer1982
Posts: 6
Joined: Wed Aug 08, 2012 4:58 am

CRITICAL - Socket timeout after 10 seconds

Post by jer1982 »

Hi everyone,

Need a little help.

Getting a CRITICAL - Socket timeout after 10 seconds error on all my services intermittently. Seems to happen every 10-12 hours or so.

Nagios core installed on CentOS 5.8 - using only passive checks on Windows servers and services.

Think the problem is probably related to NSCA on the CentOS machine as the fix for this is:

killall nsca
killall nsca
/usr/local/nagios/bin/nsca -c /usr/local/nagios/etc/nsca.cfg

Which kills 2 nsca processes then re-registers it (I think) I've discovered this through trial and error messing around trying to get it sorted

Not sure if it's related to the fact that nsca is running twice possibly?

My 2 questions are:

1). anyone have any ideas on a permanent fix for this?

2). Could I have a cronjob that just does killall nsca x 2 and then /usr/local/nagios/bin/nsca -c /usr/local/nagios/etc/nsca.cfg every 10 minutes or so - this should keep things running as doing that command when its working doesn't seem have any affect on Nagios

I tried putting the following in my etc/crontab:

# kill duplicate NSCA pid's
*/10 * * * * killall nsca
*/10 * * * * killall nsca
*/10 * * * * /usr/local/nagios/bin/nsca -c /usr/local/nagios/etc/nsca.cfg

Which I think should run those commands every 10 mins, but not sure if that will work?
jer1982
Posts: 6
Joined: Wed Aug 08, 2012 4:58 am

Re: CRITICAL - Socket timeout after 10 seconds

Post by jer1982 »

Checking in the NSCA .log file of the client machines sometimes reveals this:

2012-08-09 05:55:59: error:modules\NSCAAgent\NSCAThread.cpp:286: <<< Could not connect to: 46.20.226.250:5667 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2012-08-09 06:28:08: error:modules\NSCAAgent\NSCAThread.cpp:312: Timeout reading NSCA hdr packet (increase socket_timeout), we only got: 0

Although I definitely think the issue is on the Nagios server end as they all drop at exactly the same time, and killing NSCA and restarting it on the Nagios box kicks it back to life
jer1982
Posts: 6
Joined: Wed Aug 08, 2012 4:58 am

Re: CRITICAL - Socket timeout after 10 seconds

Post by jer1982 »

Still having this issue if anyone has any bright ideas.

I used the cronjob and it's been a little better since then but every now and then it still drops out with the same error.
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: CRITICAL - Socket timeout after 10 seconds

Post by slansing »

Strange, does this happen during any routine automated system maintenance? Is it at roughly the same time every time?

You could create a cron job to restart NSCA that is one idea. But lets see if you can dig up any information about things happening at the same time. Have you tried tailing the Nagios log's when this happens?
jer1982
Posts: 6
Joined: Wed Aug 08, 2012 4:58 am

Re: CRITICAL - Socket timeout after 10 seconds

Post by jer1982 »

No pattern that I can see - sometimes it's overnight or on the weekend, sometimes it's in the middle of the working day.

I've implemented the cronjob which restarts NSCA and that has seen an improvement but it still drops out every now and then.

I have tailed the log and basically all input stops, the log just freezes as though it's not receiving any information and nothing starts again until NSCA is killed and restarted, so it definitely points to NSCA issues.
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: CRITICAL - Socket timeout after 10 seconds

Post by slansing »

Are two NSCA instances spawning when it locks up?
jer1982
Posts: 6
Joined: Wed Aug 08, 2012 4:58 am

Re: CRITICAL - Socket timeout after 10 seconds

Post by jer1982 »

Seems that way - I have to killall nsca twice
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: CRITICAL - Socket timeout after 10 seconds

Post by slansing »

Try removing:

Code: Select all

*/10 * * * * /usr/local/nagios/bin/nsca -c /usr/local/nagios/etc/nsca.cfg
From your cron set up.
jer1982
Posts: 6
Joined: Wed Aug 08, 2012 4:58 am

Re: CRITICAL - Socket timeout after 10 seconds

Post by jer1982 »

I think if that line doesn't run then NSCA doesn't start at all and my passive checks aren't picked up - tailing the log without having run that leaves it basically empty, nothing being received
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: CRITICAL - Socket timeout after 10 seconds

Post by slansing »

If you start NSCA it only needs to start once, you have it set to start every 10 minutes hence why you are having the issue with multiple instances running. NSCA should start automatically by itself.
Locked