Starting a week or so ago, we're getting a lot of "Unknown" errors from Nagios. They all say something like:
We apparently have Nagios setup so that our Windows Domain Controller handles the actual SNMP polling of the servers since it can see them all, and Nagios pulls its info from the DC.check_wsc UNKNOWN: Problem getting service response message, code=500, message=read failed: Connection reset by peer
We noticed that if someone remotes into the DC and logs on, the Unknown errors all stop. Then as soon as that person logs off so no one is actively remote into the machine, the Unknown errors come back. It's as if remoting into the machine wakes it up from a half asleep state so the communication starts working better.
We've had this up and running for years and this just started happening a week or so ago which happens to coincide with some Windows Patching we did. However, I don't see anything in the patch descriptions that it would have this affect on what we're doing.
Has anyone seen anything like this before and do you have any idea what might be causing it and maybe how to fix it? Normally it just gets the Unknown error once and the next check will work but sometimes it fails 2 or 3 times in a row which trigger an alert email to go out. We'd obviously not have people get woken up at night for no reason.