Production server wproc errors returned

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
gregwhite
Posts: 206
Joined: Wed Jun 01, 2011 12:40 pm

Production server wproc errors returned

Post by gregwhite »

Yesterday, our production server started reporting high load averages, climbing as high as 88. I had executed an apply config after deleting a device and it just continued to run never completing. Navigating NagiosXI had slow response.
The first entry in the event log is:

Runtime Error 2019-06-10 14:57:11 wproc: GLOBAL SERVICE EVENTHANDLER job 19748 from worker Core Worker 15277 is a non-check helper but exited with return code 1
Service Critical 2019-06-10 14:57:11 SERVICE ALERT: KEN-INTERNET-SW1;Check Network latency and packet loss;CRITICAL;HARD;5;CRITICAL - 64.212.90.4: rta nan, lost 100%

The event log showed these other errors:
Runtime Error 2019-06-10 15:23:41 wproc: stdout line 01: UNABLE TO CONNECT TO DB - EXITING!
Runtime Error 2019-06-10 15:23:41 wproc: stderr line 02: using dumb terminal settings.
Runtime Error 2019-06-10 15:23:41 wproc: stderr line 01: No entry for terminal type "unknown";
Runtime Error 2019-06-10 15:23:41 wproc: early_timeout=0; exited_ok=1; wait_status=256; error_code=0;
Runtime Error 2019-06-10 15:23:41 wproc: GLOBAL SERVICE EVENTHANDLER job 22510 from worker Core Worker 15257 is a non-check helper but exited with return code 1
Service Recovery 2019-06-10 15:23:41 SERVICE ALERT: WEY-97LIBBEY-RTR1;Check Network latency and packet loss;OK;SOFT;2;OK - 172.20.254.150: rta 5.736ms, lost 0%
Runtime Error 2019-06-10 15:23:30 wproc: stdout line 01: UNABLE TO CONNECT TO DB - EXITING!
Runtime Error 2019-06-10 15:23:30 wproc: stderr line 02: using dumb terminal settings.
Runtime Error 2019-06-10 15:23:30 wproc: stderr line 01: No entry for terminal type "unknown";
Runtime Error 2019-06-10 15:23:30 wproc: early_timeout=0; exited_ok=1; wait_status=256; error_code=0;
Runtime Error 2019-06-10 15:23:30 wproc: GLOBAL SERVICE EVENTHANDLER job 22491 from worker Core Worker 15253 is a non-check helper but exited with return code 1
Service Recovery 2019-06-10 15:23:29 SERVICE ALERT: CHE-228BILLER-SW31;Check Network latency and packet loss;OK;SOFT;2;OK - 172.22.75.31: rta 6.563ms, lost 0%
Runtime Error 2019-06-10 15:23:20 wproc: stdout line 01: UNABLE TO CONNECT TO DB - EXITING!
Runtime Error 2019-06-10 15:23:20 wproc: stderr line 02: using dumb terminal settings.
Runtime Error 2019-06-10 15:23:20 wproc: stderr line 01: No entry for terminal type "unknown";
Runtime Error 2019-06-10 15:23:20 wproc: early_timeout=0; exited_ok=1; wait_status=256; error_code=0;
Runtime Error 2019-06-10 15:23:20 wproc: GLOBAL SERVICE EVENTHANDLER job 22473 from worker Core Worker 15268 is a non-check helper but exited with return code 1
Service Unknown 2019-06-10 15:23:20 SERVICE ALERT: NOR-1177PROVI-SW31;Interface Table Status - edge network devices;UNKNOWN;SOFT;2;UNKNOWN - Plugin timed out (15s).
Runtime Error 2019-06-10 15:23:16 wproc: stdout line 01: UNABLE TO CONNECT TO DB - EXITING!
Runtime Error 2019-06-10 15:23:16 wproc: stderr line 02: using dumb terminal settings.
Runtime Error 2019-06-10 15:23:16 wproc: stderr line 01: No entry for terminal type "unknown";
Runtime Error 2019-06-10 15:23:16 wproc: early_timeout=0; exited_ok=1; wait_status=256; error_code=0;
Runtime Error 2019-06-10 15:23:16 wproc: GLOBAL SERVICE EVENTHANDLER job 22465 from worker Core Worker 15279 is a non-check helper but exited with return code 1
Service Recovery 2019-06-10 15:23:16 SERVICE ALERT: BTR-111GROSSM-SW26;Check Network latency and packet loss;OK;SOFT;2;OK - 172.22.60.26: rta 3.246ms, lost 0%
Runtime Error 2019-06-10 15:23:14 wproc: stdout line 01: UNABLE TO CONNECT TO DB - EXITING!
Service Notification 2019-06-10 15:16:01 SERVICE NOTIFICATION: pkarr;localhost;Nagios XI - Jobs;CRITICAL;xi_service_notification_handler;Error: Could not parse XML from http://localhost/nagiosxi ()
Service Notification 2019-06-10 15:16:01 SERVICE NOTIFICATION: gwhite;localhost;Nagios XI - Jobs;CRITICAL;xi_service_notification_handler;Error: Could not parse XML from http://localhost/nagiosxi ()
Service Notification 2019-06-10 15:16:01 SERVICE NOTIFICATION: bhankers;localhost;Nagios XI - Jobs;CRITICAL;xi_service_notification_handler;Error: Could not parse XML from http://localhost/nagiosxi ()
Runtime Error 2019-06-10 15:05:56 wproc: host=localhost; service=Nagios XI - Jobs; contact=pkarr
Runtime Error 2019-06-10 15:05:56 wproc: NOTIFY job 20661 from worker Core Worker 15272 is a non-check helper but exited with return code 1
Runtime Error 2019-06-10 15:05:56 wproc: stdout line 01: UNABLE TO CONNECT TO DB - EXITING!
Runtime Error 2019-06-10 15:05:56 wproc: stderr line 02: using dumb terminal settings.
Runtime Error 2019-06-10 15:05:56 wproc: stderr line 01: No entry for terminal type "unknown";
Runtime Error 2019-06-10 15:05:56 wproc: early_timeout=0; exited_ok=1; wait_status=256; error_code=0;
Runtime Error 2019-06-10 15:05:56 wproc: host=localhost; service=Nagios XI - Jobs; contact=bhankers
Runtime Error 2019-06-10 15:05:56 wproc: NOTIFY job 20661 from worker Core Worker 15265 is a non-check helper but exited with return code 1
Service Notification 2019-06-10 15:05:56 SERVICE NOTIFICATION: pkarr;localhost;Nagios XI - Jobs;WARNING;xi_service_notification_handler;Nonstop Operations Manager (nom) stale (5035 seconds old), Nonstop Operations Manager (nom) stale (5035 seconds old), Cleaner (cleaner) stale (474 seconds old)
Service Notification 2019-06-10 15:05:56 SERVICE NOTIFICATION: gwhite;localhost;Nagios XI - Jobs;WARNING;xi_service_notification_handler;Nonstop Operations Manager (nom) stale (5035 seconds old), Nonstop Operations Manager (nom) stale (5035 seconds old), Cleaner (cleaner) stale (474 seconds old)
Service Notification 2019-06-10 15:05:56 SERVICE NOTIFICATION: bhankers;localhost;Nagios XI - Jobs;WARNING;xi_service_notification_handler;Nonstop Operations Manager (nom) stale (5035 seconds old), Nonstop Operations Manager (nom) stale (5035 seconds old), Cleaner (cleaner) stale (474 seconds ol

Then this morning we had several more and it has been quiet since 8:40.

Information 2019-06-11 08:40:48 wproc: Core Worker 1324: job 98998 (pid=14148): Dormant child reaped
Runtime Error 2019-06-11 08:40:48 wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Runtime Error 2019-06-11 08:40:48 wproc: host=DRBOTTOMsw2; service=Check fan status on a cisco router or switch;
Runtime Error 2019-06-11 08:40:48 wproc: CHECK job 98998 from worker Core Worker 1324 timed out after 60.01s
Information 2019-06-11 08:40:48 wproc: Core Worker 1324: job 98998 (pid=14148) timed out. Killing it
Runtime Error 2019-06-11 03:15:58 wproc: stdout line 01: OK - No valid historical dataset... [details]
Runtime Error 2019-06-11 03:15:58 wproc: early_timeout=0; exited_ok=1; wait_status=14; error_code=0;
Runtime Error 2019-06-11 03:15:58 wproc: host=NOR-1177PROVI-SW41; service=Interface Table Status - edge network devices;
Runtime Error 2019-06-11 03:15:58 wproc: CHECK job 64635 from worker Core Worker 1318 died by signal 14 after 15.17 seconds
pkarr
Posts: 58
Joined: Fri Oct 05, 2012 1:01 pm

Re: Production server wproc errors returned

Post by pkarr »

Some additional information on yesterday's issues on LKENSHERLOCKP01
We received an email notification regarding critical load averages and Nagios XI Jobs - stale (I've attached it below)

Greg was running an appy config that never (20mins) finished. So we rebooted the server.
After it came up it was still showing critical errors for dbmaint and nom.
I checked out the log file for the associated cron jobs in /usr/local/nagiosxi/var
and run the updates for nom (nom.php) and dbmaint (dbmaint.php) manually.

And ran a repair on the database (repair_databases.sh), which came back error free.

I have included the error message we initially received, a current system profile and the messages log file.

thanks,
Penny and Greg
You do not have the required permissions to view the files attached to this post.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Production server wproc errors returned

Post by cdienger »

I'd be curious to see the apache logs(/var/log/httpd/) from yesterday to line up anything in there with the messages in the event log.

The mysqld.log from the database server may also have some interesting clues.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
pkarr
Posts: 58
Joined: Fri Oct 05, 2012 1:01 pm

Re: Production server wproc errors returned

Post by pkarr »

Sure. Here are are http log files. Hmm, the mysql database is remote, when I attempted to access it, /var/log/mysqld.log was empty.

thanks,
Penny
You do not have the required permissions to view the files attached to this post.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Production server wproc errors returned

Post by cdienger »

You can get the IP that the database is hosted at with:

Code: Select all

cat /usr/local/nagiosxi/html/config.inc.php | sed -rn "/\"ndoutils\" => array\(*/,/\"dbmaint\"/p" | grep -o -P '(?<="dbserver" => ).*(?=,)' | tr -d \'
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
gregwhite
Posts: 206
Joined: Wed Jun 01, 2011 12:40 pm

Re: Production server wproc errors returned

Post by gregwhite »

Penny is working on getting that information. Wanted to let you know that the NON errors came back and the load averages have gone critical.
pkarr
Posts: 58
Joined: Fri Oct 05, 2012 1:01 pm

Re: Production server wproc errors returned

Post by pkarr »

Ok but I already know the ip address of our mysql db server, LKENFUSIONP01 - 172.22.3.125
and that is what running your command came back with:

[root@lkensherlockp01 ~]# cat /usr/local/nagiosxi/html/config.inc.php | sed -rn "/\"ndoutils\" => array\(*/,/\"dbmaint\ "/p" | grep -o -P '(?<="dbserver" => ).*(?=,)' | tr -d \'
172.22.3.125
[root@lkensherlockp01 ~]#

Actually NOM has just started to act up again. I ran nom.php mnaually and that cleared it
Let me look at the mysqld.log file on the LKENFUSIONP01 and see if there is anything in it.
Sigh, there wasn't.

-Penny
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Production server wproc errors returned

Post by cdienger »

Let's get another profile taken while the problem is occurring.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
pkarr
Posts: 58
Joined: Fri Oct 05, 2012 1:01 pm

Re: Production server wproc errors returned

Post by pkarr »

Ok here it is.

-Penny
You do not have the required permissions to view the files attached to this post.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Production server wproc errors returned

Post by cdienger »

Not really seeing anything new in this one. Can you clarify by what you mean by "NOM acting up" ? Please provide us with a copy of /usr/local/nagiosxi/var/nom.log.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Locked