NRPE agent failed on multiple Solaris servers
Posted: Wed Oct 02, 2019 6:28 am
Referencing my previous post https://support.nagios.com/forum/viewto ... 16&t=54357, the issue occurred again this morning. Same version of NRPE as the previous post. This time, it occurred only on some of the zones across 3 different pieces of hardware. Solaris OS is 5.11 11.4.10.3.0 sun4v sparc sun4v
We ran the requested command, and we see:
Looking at the service log file listed above, I see:
Checking the /var/adm/messages, I see the same logs as last time:
We ran the requested command, and we see:
Code: Select all
$ svcs -xv svc:/network/nagios/nrpe:default
svc:/network/nagios/nrpe:default (NRPE daemon)
State: maintenance since October 2, 2019 at 3:02:49 AM EDT
Reason: Start method failed repeatedly, last died on Killed (9).
See: http://support.oracle.com/msg/SMF-8000-KS
See: http://www.nagios.org
See: /var/svc/log/network-nagios-nrpe:default.log
Impact: This service is not running.
Code: Select all
$ cat /var/svc/log/network-nagios-nrpe:default.log
[ 2019 Oct 2 03:01:18 Stopping because all processes in service exited. ]
[ 2019 Oct 2 03:01:21 Executing stop method (:kill). ]
[ 2019 Oct 2 03:02:42 Executing start method ("/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d"). ]
[ 2019 Oct 2 03:02:48 Method or service exit timed out. Killing contract 1158050. ]
[ 2019 Oct 2 03:02:49 Method "start" failed due to signal KILL. ]
Code: Select all
Oct 2 03:02:02 SERVER svc.startd[18535]: [ID 462725 daemon.warning] svc:/network/nagios/nrpe:default: Couldn't fork to execute method /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d: Resource temporarily unavailable
Oct 2 03:02:32 SERVER svc.startd[18535]: [ID 462725 daemon.warning] svc:/network/nagios/nrpe:default: Couldn't fork to execute method /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d: Resource temporarily unavailable
Oct 2 03:02:48 SERVER svc.startd[18535]: [ID 737100 daemon.warning] svc:/network/nagios/nrpe:default: Method or service exit timed out. Killing contract 1158050.
Oct 2 03:02:49 SERVER svc.startd[18535]: [ID 636263 daemon.warning] svc:/network/nagios/nrpe:default: Method "/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d" failed due to signal KILL.
Oct 2 03:02:49 SERVER svc.startd[18535]: [ID 748625 daemon.error] network/nagios/nrpe:default failed: transitioned to maintenance (see 'svcs -xv' for details)
Oct 2 03:02:51 SERVER fmd: [ID 377184 daemon.error] SUNW-MSG-ID: SMF-8000-YX, TYPE: Defect, VER: 1, SEVERITY: Major#012EVENT-TIME: Wed Oct 2 03:02:50 EDT 2019#012PLATFORM: unknown, CSN: unknown, HOSTNAME: SERVER #012SOURCE: software-diagnosis, REV: 0.2#012EVENT-ID: f5051662-41bb-492e-bc0c-835d7cd0b805#012DESC: Service svc:/network/nagios/nrpe:default failed - a method is failing in a retryable manner but too often.#012AUTO-RESPONSE: The service has been placed into the maintenance state.#012IMPACT: svc:/network/nagios/nrpe:default is unavailable.#012REC-ACTION: Run 'svcs -xv svc:/network/nagios/nrpe:default' to determine the generic reason why the service failed, the location of any logfiles, and a list of other services impacted. Please refer to the associated reference document at http://support.oracle.com/msg/SMF-8000-YX for the latest service procedures and policies regarding this diagnosis.