NRPE agent failed on multiple Solaris servers

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
hbouma
Posts: 483
Joined: Tue Feb 27, 2018 9:31 am

NRPE agent failed on multiple Solaris servers

Post by hbouma »

We are running NRPE Version: 3.2.1 on hundreds of SPARC Solaris LDOM's (5.11 11.4.7.5.0 sun4v sparc sun4v). These are grouped into different LDOM's on the hardware. One set of hardware (7 agents on different LDOM's) had every single NRPE agent fail at 5:52:40AM on Sunday with the following errors.

Researching what "fork() failed with error 12, bailing out" means, I am seeing multiple possible meanings. Can anyone tell me what may have caused this error or point me to were I could find this info?

Code: Select all

Jun 16 05:52:40 SERVERNAME nrpe[17986]: [ID 702911 daemon.error] fork() failed with error 12, bailing out...
Jun 16 05:52:54 SERVERNAME nrpe[19611]: [ID 702911 daemon.notice] Starting up daemon
Jun 16 05:52:54 SERVERNAME nrpe[19611]: [ID 702911 daemon.notice] Warning: Daemon is configured to accept command arguments from clients!
Jun 16 05:53:54 SERVERNAME nrpe[19611]: [ID 702911 daemon.error] fork() failed with error 12, bailing out...
Jun 16 05:53:54 SERVERNAME nrpe[19672]: [ID 702911 daemon.notice] Starting up daemon
Jun 16 05:53:54 SERVERNAME nrpe[19672]: [ID 702911 daemon.notice] Warning: Daemon is configured to accept command arguments from clients!
Jun 16 05:54:40 SERVERNAME nrpe[19672]: [ID 702911 daemon.error] fork() failed with error 12, bailing out...
Jun 16 05:54:40 SERVERNAME nrpe[21421]: [ID 702911 daemon.notice] Starting up daemon
Jun 16 05:54:40 SERVERNAME nrpe[21421]: [ID 702911 daemon.notice] Warning: Daemon is configured to accept command arguments from clients!
Jun 16 05:56:53 SERVERNAME nrpe[21421]: [ID 702911 daemon.error] fork() failed with error 12, bailing out...
Jun 16 05:56:53 SERVERNAME nrpe[24766]: [ID 702911 daemon.error] Error: (!log_opts) Could not complete SSL handshake with 10.201.252.16: 1
Jun 16 05:56:53 SERVERNAME nrpe[24902]: [ID 702911 daemon.notice] Starting up daemon
Jun 16 05:56:53 SERVERNAME nrpe[24902]: [ID 702911 daemon.notice] Warning: Daemon is configured to accept command arguments from clients!
Jun 16 05:58:44 SERVERNAME nrpe[24902]: [ID 702911 daemon.error] fork() failed with error 12, bailing out...
Jun 16 05:58:45 SERVERNAME svc.startd[19762]: [ID 652011 daemon.warning] svc:/system/sstore:default: Method "/lib/svc/method/svc-sstore start" failed with exit status 1.
Jun 16 05:58:46 SERVERNAME svc.startd[19762]: [ID 748625 daemon.error] network/nagios/nrpe:default failed repeatedly: transitioned to maintenance (see 'svcs -xv' for details)
Jun 16 05:58:47 SERVERNAME fmd: [ID 377184 daemon.error] SUNW-MSG-ID: SMF-8000-YX, TYPE: Defect, VER: 1, SEVERITY: Major
Jun 16 05:58:47 SERVERNAME EVENT-TIME: Sun Jun 16 05:58:47 EDT 2019
Jun 16 05:58:47 SERVERNAME PLATFORM: unknown, CSN: unknown, HOSTNAME: SERVERNAME
Jun 16 05:58:47 SERVERNAME SOURCE: software-diagnosis, REV: 0.2
Jun 16 05:58:47 SERVERNAME EVENT-ID: 27556f62-faf6-4dfa-b825-c6ae8b97b76f
Jun 16 05:58:47 SERVERNAME DESC: Service svc:/network/nagios/nrpe:default failed - the instance is restarting too quickly.
Jun 16 05:58:47 SERVERNAME AUTO-RESPONSE: The service has been placed into the maintenance state.
Jun 16 05:58:47 SERVERNAME IMPACT: svc:/network/nagios/nrpe:default is unavailable.
Jun 16 05:58:47 SERVERNAME REC-ACTION: Run 'svcs -xv svc:/network/nagios/nrpe:default' to determine the generic reason why the service failed, the location of any logfiles, and a list of other services impacted. Please refer to the associated reference document at http://support.oracle.com/msg/SMF-8000-YX for the latest service procedures and policies regarding this diagnosis.
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: NRPE agent failed on multiple Solaris servers

Post by tgriep »

If you run this command on one or the servers, what errors does it display?
svcs -xv svc:/network/nagios/nrpe:default

Was there any network outages at that time or some sort of script / application ran at that time on the servers that could of caused the issue?
Be sure to check out our Knowledgebase for helpful articles and solutions!
hbouma
Posts: 483
Joined: Tue Feb 27, 2018 9:31 am

Re: NRPE agent failed on multiple Solaris servers

Post by hbouma »

There was no known network outage or application issues.

Our Server Team cleared the error and restarted the service before we could grab any info about the failure other than what is pasted from the logs.
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: NRPE agent failed on multiple Solaris servers

Post by tgriep »

That's unfortunate that you cannot get any detailed logs.

When a new connection come in to the NRPE agent, it forks a new copy.
If the system could not do that because of the memory was full, it would generate that error.
Or that it would not release the forked copies which would cause that error, which kind of looks like that happened.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked