Problem with unknown alerts

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
mikewazowski
Posts: 6
Joined: Fri Nov 23, 2012 10:01 am

Problem with unknown alerts

Post by mikewazowski »

Hi,

I have a Nagios core version 3.3.1 with almost 300 monitored services and we have some problems with UNKNOWN alerts. Viewing some recommendations for resolve this problem I increase the timeout to 30 seconds but i still receiving alerts. In the moment that i start receiving UNKNOWN alerts generally also received CRITICAL alerts.
If execute the check nrpe manually never have UNKNOWN altought at the moment that I see UNKNOWN alerts on the dashboard. On the other hand I check in the nagios server and the remote servers and the cpu is almost iddle. Any idea of what could be???
Let me know if you need more info.
Thanks!!
User avatar
jsmurphy
Posts: 989
Joined: Wed Aug 18, 2010 9:46 pm

Re: Problem with unknown alerts

Post by jsmurphy »

Sounds to me like you are experiencing intermittent network connectivity, congestion or QoS issues. I'm 98.9% sure that the Nagios application is not your problem here.

It could also be (though much less likely) that you are running too many active checks at a time and the OS has some kind of limit on the number of active connections and is culling them. I'm assuming that the "Critical" messages you are receiving for CPU have a description like "connection timeout" or "connection refused" or something similar. You can probably fix that by adding the "-u" flag to your command which (depending on the plugin) should make it unknown on timeout instead of critical.

Work with your network team to see if you can determine one of the problems mentioned at the start of my post.
mikewazowski
Posts: 6
Joined: Fri Nov 23, 2012 10:01 am

Re: Problem with unknown alerts

Post by mikewazowski »

Hi Murphy, thanks for your reply. I never detect connectivity problems, i use NRPE with this parameters: /$USER1$/check_nrpe -H $HOSTADDRESS$ -u -c $ARG1$ -t 30. We receive the unknown messages and almost always begin to appear when one server begins to have some critical at some services. I attached performance info from nagios server.

Thanks a lot!
Attachments
performance info nagios.JPG
User avatar
jsmurphy
Posts: 989
Joined: Wed Aug 18, 2010 9:46 pm

Re: Problem with unknown alerts

Post by jsmurphy »

What are the critical services you see? What are they checking and what is the problem description?
mikewazowski
Posts: 6
Joined: Fri Nov 23, 2012 10:01 am

Re: Problem with unknown alerts

Post by mikewazowski »

For example, i have a service monitoring the port 800x and every night the connection with the server monitored to the other server restarts. I have 4 or more service monitoring port 800x and when the server at the far restart the connections falls and nagios alert with CRITICAL. This operation is correct but in the moment that I start to get CRITICALS alerts start receiving UNKOWN alerts from services that are running without problems like DISKGROUPS monitoring or Oracle services... But this services not have any problems.
The info in the UNKNOWN is:
CHECK_NRPE: Socket timeout after 30 seconds.
I hope you can understand the problem.

Thanks again!
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Problem with unknown alerts

Post by slansing »

The timeout is generally related to just that, a disruption in the connection. Are these checks which receive this error on the same connection as the one which restarts nightly? If so, it may just be that they are running their check at around the same time your last port check runs, this would certainly line up with the reason your last port connection check goes critical from the connection restart and would explain a delay from when the connection goes down and when the socket timeout error occurs.
mikewazowski
Posts: 6
Joined: Fri Nov 23, 2012 10:01 am

Re: Problem with unknown alerts

Post by mikewazowski »

The only thing is restarted is a socks connection, the server has no fall, all the other services continue to operate normally. The strange thing is that sometimes gives UNKNOWN to other services unrelated to that really is down.

Thanks again and sorry for my english!!!
User avatar
jsmurphy
Posts: 989
Joined: Wed Aug 18, 2010 9:46 pm

Re: Problem with unknown alerts

Post by jsmurphy »

This seems fairly obvious to me. Socks is a network service, we're saying it's a network issue and the issue occurs when you are restarting the socks service.

It's fairly common for certain network services (particularly those dealing in changes to routing like socks) to cause problems with network connectivity when they are unavailable; at the very least it will need to re-initializing the network connection. So if the network connection is down, Nagios can't reach the server... which means all services for that server (even those unrelated to network connectivity) are going to display... what?

UNKNOWN! If it can't reach the server it doesn't know what the state of those services is! Furthermore it's telling you why it doesn't know: the connection timed out.
Locked