Page 1 of 1

Check_nrpe Socket timeout after 50 seconds.

Posted: Wed Mar 12, 2014 3:32 am
by quental
Good morning,
For different monitorings my remote machine, errors appear:
Check_nrpe Socket timeout after 50 seconds.

At night, the monitoring network has a lot of work because the remote machine backups are performed.
The flow of information is enormous and is saturated.

I changed the service_check_timeout to 600.
But the error continues.

In my remote machine in nrpe.cfg, the value of
command_timeout = 90.

I have also planned restart xinetd daemon but still continue to appear.

It is very important not to have this type of alert.
How can I solve it?

Thanks for your time.
Regards

Re: Check_nrpe Socket timeout after 50 seconds.

Posted: Wed Mar 12, 2014 9:52 am
by slansing
If this is due to your internal network as you say, there is nothing Nagios can do to fix that, you will need to give Nagios reliable routs that will not fail or timeout. I would definitely not recommend using a timeout of 600, as each check nagios runs forks the nagios process and keeps that open until the check is returned, thus you may be "overloading" your nagios server.

Re: Check_nrpe Socket timeout after 50 seconds.

Posted: Wed Mar 12, 2014 11:47 am
by quental
Good afternoon,

The problem is I can not give another monitoring network and that every night the backup machine becomes.

Is there any possible solution?

I modified the service_check_timeout=60.

Thanks for your time.
Regards

Re: Check_nrpe Socket timeout after 50 seconds.

Posted: Wed Mar 12, 2014 11:56 am
by slansing
I'd recommend using recurring/scheduled downtime to place all of the hosts/services into downtime when this happens, that way they will not be alerting you when their host address becomes unreachable, this can be done through Home > Recurring/Scheduled downtime.

Re: Check_nrpe Socket timeout after 50 seconds.

Posted: Thu Mar 13, 2014 3:34 am
by quental
Good morning,
It is impossible to make a schedule downtime because monitoring should be 24 hours, and and are very important monitorings

Does using NSCA eliminate the problem?

Thanks for your reply

Regards.

Re: Check_nrpe Socket timeout after 50 seconds.

Posted: Thu Mar 13, 2014 9:41 am
by slansing
Possibly, if you are not seeing the same issue with passive checks you could use NRDP to passively send check results to XI. But hear this, no manner of change in XI, or on it's agents will fix network infrastructure issues for you, if you are seeing periods where your network just drops, you can expect Nagios to tell you this fully, as that is what is happening. Without scheduling downtime, or making a custom check time period, you need to expect that Nagios will tell you what is going on, as it is.

Re: Check_nrpe Socket timeout after 50 seconds.

Posted: Wed Apr 02, 2014 1:43 am
by quental
Good morning,
At my company we have removed performing nightly backups we thought that caused the timeouts, so the network is not saturated.

On machines with more monitorings, continues to appear check_nrpe Socket timeout after 50 seconds.

It is a very critical problem, since it onto the alerts as false positives.

Please could you tell me How could you solve it?

Thanks for your time.
regards

Re: Check_nrpe Socket timeout after 50 seconds.

Posted: Wed Apr 02, 2014 11:05 am
by abrist
quental wrote:Please could you tell me How could you solve it?
Reduce network load. Or create a time period that excludes the the time that the backups are running.
You cannot have it both ways though (working checks and the current backup network load). If your backups cause checks set to 600 seconds to timeout, then there is nothing you can do to the nagios server to compensate. This is an environment problem on your network. You either need more throughput or you need to qos/stagger the backups, or you need to exclude the timeperiod with downtime or exclusions.

EDIT: I missed the part about stopping the nightly backups.
1) So even without the backups, the checks still timeout?
2) Do they only timeout at specific times?
3) when do they recover and start working again?
4) Is there any other network maintenance taking place at the same time?