Problem with unknown alerts

mikewazowski · Post by **mikewazowski** » Fri Nov 23, 2012 10:21 am

Hi,

I have a Nagios core version 3.3.1 with almost 300 monitored services and we have some problems with UNKNOWN alerts. Viewing some recommendations for resolve this problem I increase the timeout to 30 seconds but i still receiving alerts. In the moment that i start receiving UNKNOWN alerts generally also received CRITICAL alerts.
If execute the check nrpe manually never have UNKNOWN altought at the moment that I see UNKNOWN alerts on the dashboard. On the other hand I check in the nagios server and the remote servers and the cpu is almost iddle. Any idea of what could be???
Let me know if you need more info.
Thanks!!

Post by **jsmurphy** » Sun Nov 25, 2012 7:17 pm

Sounds to me like you are experiencing intermittent network connectivity, congestion or QoS issues. I'm 98.9% sure that the Nagios application is not your problem here.

It could also be (though much less likely) that you are running too many active checks at a time and the OS has some kind of limit on the number of active connections and is culling them. I'm assuming that the "Critical" messages you are receiving for CPU have a description like "connection timeout" or "connection refused" or something similar. You can probably fix that by adding the "-u" flag to your command which (depending on the plugin) should make it unknown on timeout instead of critical.

Work with your network team to see if you can determine one of the problems mentioned at the start of my post.

mikewazowski · Post by **mikewazowski** » Thu Nov 29, 2012 1:58 pm

Hi Murphy, thanks for your reply. I never detect connectivity problems, i use NRPE with this parameters: /$USER1$/check_nrpe -H $HOSTADDRESS$ -u -c $ARG1$ -t 30. We receive the unknown messages and almost always begin to appear when one server begins to have some critical at some services. I attached performance info from nagios server.

Thanks a lot!

Post by **jsmurphy** » Thu Nov 29, 2012 5:16 pm

What are the critical services you see? What are they checking and what is the problem description?

mikewazowski · Post by **mikewazowski** » Fri Nov 30, 2012 1:39 pm

For example, i have a service monitoring the port 800x and every night the connection with the server monitored to the other server restarts. I have 4 or more service monitoring port 800x and when the server at the far restart the connections falls and nagios alert with CRITICAL. This operation is correct but in the moment that I start to get CRITICALS alerts start receiving UNKOWN alerts from services that are running without problems like DISKGROUPS monitoring or Oracle services... But this services not have any problems.
The info in the UNKNOWN is:
CHECK_NRPE: Socket timeout after 30 seconds.
I hope you can understand the problem.

Thanks again!

slansing · Post by **slansing** » Fri Nov 30, 2012 2:35 pm

The timeout is generally related to just that, a disruption in the connection. Are these checks which receive this error on the same connection as the one which restarts nightly? If so, it may just be that they are running their check at around the same time your last port check runs, this would certainly line up with the reason your last port connection check goes critical from the connection restart and would explain a delay from when the connection goes down and when the socket timeout error occurs.

mikewazowski · Post by **mikewazowski** » Mon Dec 10, 2012 10:05 am

The only thing is restarted is a socks connection, the server has no fall, all the other services continue to operate normally. The strange thing is that sometimes gives UNKNOWN to other services unrelated to that really is down.

Thanks again and sorry for my english!!!

Post by **jsmurphy** » Mon Dec 10, 2012 4:58 pm

This seems fairly obvious to me. Socks is a network service, we're saying it's a network issue and the issue occurs when you are restarting the socks service.

It's fairly common for certain network services (particularly those dealing in changes to routing like socks) to cause problems with network connectivity when they are unavailable; at the very least it will need to re-initializing the network connection. So if the network connection is down, Nagios can't reach the server... which means all services for that server (even those unrelated to network connectivity) are going to display... what?

UNKNOWN! If it can't reach the server it doesn't know what the state of those services is! Furthermore it's telling you why it doesn't know: the connection timed out.

Nagios Support Forum

Problem with unknown alerts

Problem with unknown alerts

Re: Problem with unknown alerts

Re: Problem with unknown alerts

Re: Problem with unknown alerts

Re: Problem with unknown alerts

Re: Problem with unknown alerts

Re: Problem with unknown alerts

Re: Problem with unknown alerts