For example:
1. A server starts sending 'socket timeout after 30 seconds' (we'll assume there is a valid issue causing this), and continues to send every 10 minutes.
2. Other servers (usually ones with lots of checks, and usually at night) will send the same messages, but when I log in to the console and 'schedule check immediately' the alert clears right away.
Scenario 1 indicates a valid issue, but the error message only conveys that there is something wrong with the NSclient service.
- This is currently happening to a server, but seeing as how other issues are coming up (i.e. no RDP), it is probably legit, and not just a service being overloaded.
- The NRPE socket timeout is happening to all services on this host.
Scenario 2 appears to indicate the same issue, but when checked into, nothing is wrong and the alert clears as soon as I force a check.
- This happened to 3 servers (ones that have lots of checks) over a weekend, and only by increasing the vCPU count on the Nagios XI server was I able to clear up the problem.
- The NRPE socket timeout is happening to only some services on these hosts, and the affected services change - not the same ones all the time.
- It has also happened to a few URL checks (only one check per host), but those usually clear up quickly.
So, in both scenarios the error message is the same, but the root problem is not. Does that make sense?
I need to know how to get Nagios to differentiate between itself getting overloaded and a server legitimately being down/broken, if for no other reason than to reduce the amount of spam Nagios sends our NOC team.