Nagios XI - Crashed

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
operations_asavie
Posts: 33
Joined: Tue Dec 22, 2015 7:07 am

Nagios XI - Crashed

Post by operations_asavie »

Hi,

Wondering could you help me with trying to figure out the reason why nagios services stopped in the early hours of this morning for 2 hrs?

In the nagios.log I could see the following

Code: Select all

[1473123760] SERVICE ALERT: JW8F5Z1.mgmt;RT-OWN0-01 Networking;CRITICAL;SOFT;1;ESX3 CRITICAL - HOST-VM NET Unknown error
[1473123767] wproc: Core Worker 19304: job 346 (pid=30380) timed out. Killing it
[1473123767] wproc: GLOBAL SERVICE EVENTHANDLER job 346 from worker Core Worker 19304 timed out after 30.01s
[1473123767] wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1473123767] wproc:   stderr line 01: PHP Deprecated:  Comments starting with '#' are deprecated in /etc/php.ini on line 946 in Unknown on line 0
[1473123767] Warning: Global service event handler command '/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_event.php --handler-type=service --host="GWDSS22.mgmt" --service="SB00B Memory" --hostaddress="172.17.4.9" --hoststate=UP --hoststateid=0 --hosteventid=0 --hostproblemid=0 --servicestate=OK --servicestateid=0 --lastservicestate=CRITICAL --lastservicestateid=2 --servicestatetype=SOFT --currentattempt=2 --maxattempts=5 --serviceeventid=191407 --serviceproblemid=0 --serviceoutput="ESX3 OK - SB00B mem usage=2.99 %" --longserviceoutput="" --servicedowntime=0' timed out after 0.00 seconds
[1473123767] wproc: Core Worker 19304: job 346 (pid=30380): Dormant child reaped
[1473123790] wproc: Core Worker 19303: job 403 (pid=31869) timed out. Killing it
[1473123790] wproc: GLOBAL SERVICE EVENTHANDLER job 403 from worker Core Worker 19303 timed out after 30.01s
[1473123790] wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1473[1473132376] Warning: A system time change of 1712 seconds (0d 0h 28m 32s forwards in time) has been detected.  Compensating...
[1473132459] SERVICE ALERT: localhost;Current Load;OK;HARD;4;OK - load average: 2.58, 1.46, 1.16
No issues or any entries seen in the mysqld.log either that could attribute to this.
Last edited by tmcdonald on Tue Sep 06, 2016 9:33 am, edited 1 time in total.
Reason: Please use [code][/code] tags around log output
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Nagios XI - Crashed

Post by rkennedy »

Could you PM over your profile for us to review? (Admin -> System Profile -> Download Profile)
Former Nagios Employee
operations_asavie
Posts: 33
Joined: Tue Dec 22, 2015 7:07 am

Re: Nagios XI - Crashed

Post by operations_asavie »

PM sent with the requested profile.
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Nagios XI - Crashed

Post by rkennedy »

operations@asavie wrote:PM sent with the requested profile.
Could you please double check? I'm not seeing anything in my inbox.

EDIT: Profile Received.
Former Nagios Employee
operations_asavie
Posts: 33
Joined: Tue Dec 22, 2015 7:07 am

Re: Nagios XI - Crashed

Post by operations_asavie »

Not sure what the craic is here but the mail is stuck in my outbox, tried a number of times to resend but each time it doesn't go further than my outbox.

Any other ideas of how to get it over to you?
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Nagios XI - Crashed

Post by rkennedy »

Looking at the logs, it appears the load spiked through the night on the 5th -

Code: Select all

[09-05-2016 20:35:44] NPCD: WARN: MAX load reached: load 10.960000/10.000000 at i=0
[09-05-2016 20:36:17] NPCD: WARN: MAX load reached: load 10.070000/10.000000 at i=0
[09-05-2016 20:45:40] NPCD: ERROR: Executed command exits with return code '7'
[09-05-2016 20:45:40] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1473108321.perfdata.service'
[09-05-2016 20:45:55] NPCD: WARN: MAX load reached: load 11.210000/10.000000 at i=0
[09-05-2016 20:46:10] NPCD: WARN: MAX load reached: load 12.420000/10.000000 at i=1
[09-05-2016 20:46:25] NPCD: WARN: MAX load reached: load 10.020000/10.000000 at i=1
Do you have scheduled reports, checks using SNMP that failed, or anything else that happens at this time? I've seen in the past, where having a switch fail in the middle of the night will cause a high load on XI depending how many interfaces are on it. Other then that, I'm not seeing anything stick out. It looks like you have ample resources to handle the amount of checks you have.
Former Nagios Employee
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Nagios XI - Crashed

Post by tmcdonald »

operations@asavie wrote:Not sure what the craic is here but the mail is stuck in my outbox, tried a number of times to resend but each time it doesn't go further than my outbox.

Any other ideas of how to get it over to you?
For reference, in PHPBB (the forum software) if something is in the Outbox it just means the recipient has not yet viewed it. Once they have, it is moved to "Sent messages".
Former Nagios employee
Locked