NRPE agent failed on multiple Solaris servers

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
hbouma
Posts: 483
Joined: Tue Feb 27, 2018 9:31 am

NRPE agent failed on multiple Solaris servers

Post by hbouma »

Referencing my previous post https://support.nagios.com/forum/viewto ... 16&t=54357, the issue occurred again this morning. Same version of NRPE as the previous post. This time, it occurred only on some of the zones across 3 different pieces of hardware. Solaris OS is 5.11 11.4.10.3.0 sun4v sparc sun4v

We ran the requested command, and we see:

Code: Select all

$ svcs -xv svc:/network/nagios/nrpe:default
svc:/network/nagios/nrpe:default (NRPE daemon)
 State: maintenance since October  2, 2019 at  3:02:49 AM EDT
Reason: Start method failed repeatedly, last died on Killed (9).
   See: http://support.oracle.com/msg/SMF-8000-KS
   See: http://www.nagios.org
   See: /var/svc/log/network-nagios-nrpe:default.log
Impact: This service is not running.
Looking at the service log file listed above, I see:

Code: Select all

$ cat /var/svc/log/network-nagios-nrpe:default.log
[ 2019 Oct  2 03:01:18 Stopping because all processes in service exited. ]
[ 2019 Oct  2 03:01:21 Executing stop method (:kill). ]
[ 2019 Oct  2 03:02:42 Executing start method ("/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d"). ]
[ 2019 Oct  2 03:02:48 Method or service exit timed out.  Killing contract 1158050. ]
[ 2019 Oct  2 03:02:49 Method "start" failed due to signal KILL. ]
Checking the /var/adm/messages, I see the same logs as last time:

Code: Select all

Oct  2 03:02:02 SERVER svc.startd[18535]: [ID 462725 daemon.warning] svc:/network/nagios/nrpe:default: Couldn't fork to execute method /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d: Resource temporarily unavailable
Oct  2 03:02:32 SERVER svc.startd[18535]: [ID 462725 daemon.warning] svc:/network/nagios/nrpe:default: Couldn't fork to execute method /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d: Resource temporarily unavailable
Oct  2 03:02:48 SERVER svc.startd[18535]: [ID 737100 daemon.warning] svc:/network/nagios/nrpe:default: Method or service exit timed out.  Killing contract 1158050.
Oct  2 03:02:49 SERVER svc.startd[18535]: [ID 636263 daemon.warning] svc:/network/nagios/nrpe:default: Method "/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d" failed due to signal KILL.
Oct  2 03:02:49 SERVER svc.startd[18535]: [ID 748625 daemon.error] network/nagios/nrpe:default failed: transitioned to maintenance (see 'svcs -xv' for details)
Oct  2 03:02:51 SERVER fmd: [ID 377184 daemon.error] SUNW-MSG-ID: SMF-8000-YX, TYPE: Defect, VER: 1, SEVERITY: Major#012EVENT-TIME: Wed Oct  2 03:02:50 EDT 2019#012PLATFORM: unknown, CSN: unknown, HOSTNAME: SERVER #012SOURCE: software-diagnosis, REV: 0.2#012EVENT-ID: f5051662-41bb-492e-bc0c-835d7cd0b805#012DESC: Service svc:/network/nagios/nrpe:default failed - a method is failing in a retryable manner but too often.#012AUTO-RESPONSE: The service has been placed into the maintenance state.#012IMPACT: svc:/network/nagios/nrpe:default is unavailable.#012REC-ACTION: Run 'svcs -xv svc:/network/nagios/nrpe:default' to determine the generic reason why the service failed, the location of any logfiles, and a list of other services impacted. Please refer to the associated reference document at http://support.oracle.com/msg/SMF-8000-YX for the latest service procedures and policies regarding this diagnosis.
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: NRPE agent failed on multiple Solaris servers

Post by tgriep »

First, run this command on the Solaris servers to see if the NRPE agent is running

Code: Select all

ps -ef |grep nrpe
If you see something similar to this then the NRPE agent is running.

Code: Select all

 nagios   600     1   0   Mar 13 ?           0:00 /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
Do you see it running?

Kill the process and try to start the NRPE agent.
If it still does not start up, run the following on a Solaris server

Code: Select all

/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
Run this to see if the NRPE agent is running and post here if it is or not.

Code: Select all

ps -ef |grep nrpe
Be sure to check out our Knowledgebase for helpful articles and solutions!
hbouma
Posts: 483
Joined: Tue Feb 27, 2018 9:31 am

Re: NRPE agent failed on multiple Solaris servers

Post by hbouma »

tgriep wrote:First, run this command on the Solaris servers to see if the NRPE agent is running

Code: Select all

ps -ef |grep nrpe

If you see something similar to this then the NRPE agent is running.

Code: Select all

 nagios   600     1   0   Mar 13 ?           0:00 /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
Do you see it running?

Kill the process and try to start the NRPE agent.
If it still does not start up, run the following on a Solaris server

Code: Select all

/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
Run this to see if the NRPE agent is running and post here if it is or not.

Code: Select all

ps -ef |grep nrpe
At this time, I cannot get you the output. This happened on our production server and I had to get the monitoring up and running as quickly as possible.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: NRPE agent failed on multiple Solaris servers

Post by scottwilkerson »

The fact that you were getting Resource temporarily unavailable while nrpe was trying to fork it's process leads me to believe that either you were temporarily our of some type of resources, either memory, or possibly hit an open file limit on the system.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: NRPE agent failed on multiple Solaris servers

Post by tgriep »

When there is a connection to the NRPE agent, the agent forks a new copy to run the new command and on the Solaris servers, it could not do so and failed.
The bad news is that the details that you posted did not show a detailed error on why that happened.
Where there any other messages in the /var/adm/messages log file before the NRPE agent errors that could help figure out what happened?
Be sure to check out our Knowledgebase for helpful articles and solutions!
hbouma
Posts: 483
Joined: Tue Feb 27, 2018 9:31 am

Re: NRPE agent failed on multiple Solaris servers

Post by hbouma »

I did miss one line at the top of the logs, but nothing else is listed that could explain the problem.

Code: Select all

Oct  2 03:01:17 SERVER nrpe[144]: [ID 702911 daemon.error] fork() failed with error 11, bailing out...
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: NRPE agent failed on multiple Solaris servers

Post by tgriep »

Thanks for looking in the log file. Nothing much to go on.
If you can post the full log file or at lease 10 minutes of it before the failure, there may be something else we can look at.
Be sure to check out our Knowledgebase for helpful articles and solutions!
hbouma
Posts: 483
Joined: Tue Feb 27, 2018 9:31 am

Re: NRPE agent failed on multiple Solaris servers

Post by hbouma »

PM sent with log file.
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: NRPE agent failed on multiple Solaris servers

Post by tgriep »

I received the log. Yea, nothing to go on.

You can see about editing the nrpe.cfg file and enabling debugging by setting the following option to 1.

Code: Select all

debug=1
Save it and restart NRPE to load the change and then see if it logs anything if it happens again.
Be sure to check out our Knowledgebase for helpful articles and solutions!
hbouma
Posts: 483
Joined: Tue Feb 27, 2018 9:31 am

Re: NRPE agent failed on multiple Solaris servers

Post by hbouma »

At this point, I think we can just close this. We are in the process of migrating off of this agent and OS, so my management does not see the extra work and load on the server as worth it.
Locked