Page 1 of 2

NRPE agent failed on multiple Solaris servers

Posted: Wed Oct 02, 2019 6:28 am
by hbouma
Referencing my previous post https://support.nagios.com/forum/viewto ... 16&t=54357, the issue occurred again this morning. Same version of NRPE as the previous post. This time, it occurred only on some of the zones across 3 different pieces of hardware. Solaris OS is 5.11 11.4.10.3.0 sun4v sparc sun4v

We ran the requested command, and we see:

Code: Select all

$ svcs -xv svc:/network/nagios/nrpe:default
svc:/network/nagios/nrpe:default (NRPE daemon)
 State: maintenance since October  2, 2019 at  3:02:49 AM EDT
Reason: Start method failed repeatedly, last died on Killed (9).
   See: http://support.oracle.com/msg/SMF-8000-KS
   See: http://www.nagios.org
   See: /var/svc/log/network-nagios-nrpe:default.log
Impact: This service is not running.
Looking at the service log file listed above, I see:

Code: Select all

$ cat /var/svc/log/network-nagios-nrpe:default.log
[ 2019 Oct  2 03:01:18 Stopping because all processes in service exited. ]
[ 2019 Oct  2 03:01:21 Executing stop method (:kill). ]
[ 2019 Oct  2 03:02:42 Executing start method ("/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d"). ]
[ 2019 Oct  2 03:02:48 Method or service exit timed out.  Killing contract 1158050. ]
[ 2019 Oct  2 03:02:49 Method "start" failed due to signal KILL. ]
Checking the /var/adm/messages, I see the same logs as last time:

Code: Select all

Oct  2 03:02:02 SERVER svc.startd[18535]: [ID 462725 daemon.warning] svc:/network/nagios/nrpe:default: Couldn't fork to execute method /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d: Resource temporarily unavailable
Oct  2 03:02:32 SERVER svc.startd[18535]: [ID 462725 daemon.warning] svc:/network/nagios/nrpe:default: Couldn't fork to execute method /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d: Resource temporarily unavailable
Oct  2 03:02:48 SERVER svc.startd[18535]: [ID 737100 daemon.warning] svc:/network/nagios/nrpe:default: Method or service exit timed out.  Killing contract 1158050.
Oct  2 03:02:49 SERVER svc.startd[18535]: [ID 636263 daemon.warning] svc:/network/nagios/nrpe:default: Method "/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d" failed due to signal KILL.
Oct  2 03:02:49 SERVER svc.startd[18535]: [ID 748625 daemon.error] network/nagios/nrpe:default failed: transitioned to maintenance (see 'svcs -xv' for details)
Oct  2 03:02:51 SERVER fmd: [ID 377184 daemon.error] SUNW-MSG-ID: SMF-8000-YX, TYPE: Defect, VER: 1, SEVERITY: Major#012EVENT-TIME: Wed Oct  2 03:02:50 EDT 2019#012PLATFORM: unknown, CSN: unknown, HOSTNAME: SERVER #012SOURCE: software-diagnosis, REV: 0.2#012EVENT-ID: f5051662-41bb-492e-bc0c-835d7cd0b805#012DESC: Service svc:/network/nagios/nrpe:default failed - a method is failing in a retryable manner but too often.#012AUTO-RESPONSE: The service has been placed into the maintenance state.#012IMPACT: svc:/network/nagios/nrpe:default is unavailable.#012REC-ACTION: Run 'svcs -xv svc:/network/nagios/nrpe:default' to determine the generic reason why the service failed, the location of any logfiles, and a list of other services impacted. Please refer to the associated reference document at http://support.oracle.com/msg/SMF-8000-YX for the latest service procedures and policies regarding this diagnosis.

Re: NRPE agent failed on multiple Solaris servers

Posted: Wed Oct 02, 2019 1:45 pm
by tgriep
First, run this command on the Solaris servers to see if the NRPE agent is running

Code: Select all

ps -ef |grep nrpe
If you see something similar to this then the NRPE agent is running.

Code: Select all

 nagios   600     1   0   Mar 13 ?           0:00 /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
Do you see it running?

Kill the process and try to start the NRPE agent.
If it still does not start up, run the following on a Solaris server

Code: Select all

/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
Run this to see if the NRPE agent is running and post here if it is or not.

Code: Select all

ps -ef |grep nrpe

Re: NRPE agent failed on multiple Solaris servers

Posted: Wed Oct 02, 2019 2:53 pm
by hbouma
tgriep wrote:First, run this command on the Solaris servers to see if the NRPE agent is running

Code: Select all

ps -ef |grep nrpe

If you see something similar to this then the NRPE agent is running.

Code: Select all

 nagios   600     1   0   Mar 13 ?           0:00 /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
Do you see it running?

Kill the process and try to start the NRPE agent.
If it still does not start up, run the following on a Solaris server

Code: Select all

/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
Run this to see if the NRPE agent is running and post here if it is or not.

Code: Select all

ps -ef |grep nrpe
At this time, I cannot get you the output. This happened on our production server and I had to get the monitoring up and running as quickly as possible.

Re: NRPE agent failed on multiple Solaris servers

Posted: Wed Oct 02, 2019 3:20 pm
by scottwilkerson
The fact that you were getting Resource temporarily unavailable while nrpe was trying to fork it's process leads me to believe that either you were temporarily our of some type of resources, either memory, or possibly hit an open file limit on the system.

Re: NRPE agent failed on multiple Solaris servers

Posted: Wed Oct 02, 2019 3:30 pm
by tgriep
When there is a connection to the NRPE agent, the agent forks a new copy to run the new command and on the Solaris servers, it could not do so and failed.
The bad news is that the details that you posted did not show a detailed error on why that happened.
Where there any other messages in the /var/adm/messages log file before the NRPE agent errors that could help figure out what happened?

Re: NRPE agent failed on multiple Solaris servers

Posted: Thu Oct 03, 2019 7:36 am
by hbouma
I did miss one line at the top of the logs, but nothing else is listed that could explain the problem.

Code: Select all

Oct  2 03:01:17 SERVER nrpe[144]: [ID 702911 daemon.error] fork() failed with error 11, bailing out...

Re: NRPE agent failed on multiple Solaris servers

Posted: Thu Oct 03, 2019 8:54 am
by tgriep
Thanks for looking in the log file. Nothing much to go on.
If you can post the full log file or at lease 10 minutes of it before the failure, there may be something else we can look at.

Re: NRPE agent failed on multiple Solaris servers

Posted: Thu Oct 03, 2019 9:02 am
by hbouma
PM sent with log file.

Re: NRPE agent failed on multiple Solaris servers

Posted: Thu Oct 03, 2019 12:06 pm
by tgriep
I received the log. Yea, nothing to go on.

You can see about editing the nrpe.cfg file and enabling debugging by setting the following option to 1.

Code: Select all

debug=1
Save it and restart NRPE to load the change and then see if it logs anything if it happens again.

Re: NRPE agent failed on multiple Solaris servers

Posted: Thu Oct 03, 2019 1:09 pm
by hbouma
At this point, I think we can just close this. We are in the process of migrating off of this agent and OS, so my management does not see the extra work and load on the server as worth it.