Page 1 of 1

Nagios XI - Crashed

Posted: Tue Sep 06, 2016 4:51 am
by operations_asavie
Hi,

Wondering could you help me with trying to figure out the reason why nagios services stopped in the early hours of this morning for 2 hrs?

In the nagios.log I could see the following

Code: Select all

[1473123760] SERVICE ALERT: JW8F5Z1.mgmt;RT-OWN0-01 Networking;CRITICAL;SOFT;1;ESX3 CRITICAL - HOST-VM NET Unknown error
[1473123767] wproc: Core Worker 19304: job 346 (pid=30380) timed out. Killing it
[1473123767] wproc: GLOBAL SERVICE EVENTHANDLER job 346 from worker Core Worker 19304 timed out after 30.01s
[1473123767] wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1473123767] wproc:   stderr line 01: PHP Deprecated:  Comments starting with '#' are deprecated in /etc/php.ini on line 946 in Unknown on line 0
[1473123767] Warning: Global service event handler command '/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_event.php --handler-type=service --host="GWDSS22.mgmt" --service="SB00B Memory" --hostaddress="172.17.4.9" --hoststate=UP --hoststateid=0 --hosteventid=0 --hostproblemid=0 --servicestate=OK --servicestateid=0 --lastservicestate=CRITICAL --lastservicestateid=2 --servicestatetype=SOFT --currentattempt=2 --maxattempts=5 --serviceeventid=191407 --serviceproblemid=0 --serviceoutput="ESX3 OK - SB00B mem usage=2.99 %" --longserviceoutput="" --servicedowntime=0' timed out after 0.00 seconds
[1473123767] wproc: Core Worker 19304: job 346 (pid=30380): Dormant child reaped
[1473123790] wproc: Core Worker 19303: job 403 (pid=31869) timed out. Killing it
[1473123790] wproc: GLOBAL SERVICE EVENTHANDLER job 403 from worker Core Worker 19303 timed out after 30.01s
[1473123790] wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1473[1473132376] Warning: A system time change of 1712 seconds (0d 0h 28m 32s forwards in time) has been detected.  Compensating...
[1473132459] SERVICE ALERT: localhost;Current Load;OK;HARD;4;OK - load average: 2.58, 1.46, 1.16
No issues or any entries seen in the mysqld.log either that could attribute to this.

Re: Nagios XI - Crashed

Posted: Tue Sep 06, 2016 9:37 am
by rkennedy
Could you PM over your profile for us to review? (Admin -> System Profile -> Download Profile)

Re: Nagios XI - Crashed

Posted: Wed Sep 07, 2016 3:00 am
by operations_asavie
PM sent with the requested profile.

Re: Nagios XI - Crashed

Posted: Wed Sep 07, 2016 9:42 am
by rkennedy
operations@asavie wrote:PM sent with the requested profile.
Could you please double check? I'm not seeing anything in my inbox.

EDIT: Profile Received.

Re: Nagios XI - Crashed

Posted: Thu Sep 08, 2016 9:03 am
by operations_asavie
Not sure what the craic is here but the mail is stuck in my outbox, tried a number of times to resend but each time it doesn't go further than my outbox.

Any other ideas of how to get it over to you?

Re: Nagios XI - Crashed

Posted: Thu Sep 08, 2016 10:04 am
by rkennedy
Looking at the logs, it appears the load spiked through the night on the 5th -

Code: Select all

[09-05-2016 20:35:44] NPCD: WARN: MAX load reached: load 10.960000/10.000000 at i=0
[09-05-2016 20:36:17] NPCD: WARN: MAX load reached: load 10.070000/10.000000 at i=0
[09-05-2016 20:45:40] NPCD: ERROR: Executed command exits with return code '7'
[09-05-2016 20:45:40] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1473108321.perfdata.service'
[09-05-2016 20:45:55] NPCD: WARN: MAX load reached: load 11.210000/10.000000 at i=0
[09-05-2016 20:46:10] NPCD: WARN: MAX load reached: load 12.420000/10.000000 at i=1
[09-05-2016 20:46:25] NPCD: WARN: MAX load reached: load 10.020000/10.000000 at i=1
Do you have scheduled reports, checks using SNMP that failed, or anything else that happens at this time? I've seen in the past, where having a switch fail in the middle of the night will cause a high load on XI depending how many interfaces are on it. Other then that, I'm not seeing anything stick out. It looks like you have ample resources to handle the amount of checks you have.

Re: Nagios XI - Crashed

Posted: Thu Sep 08, 2016 10:30 am
by tmcdonald
operations@asavie wrote:Not sure what the craic is here but the mail is stuck in my outbox, tried a number of times to resend but each time it doesn't go further than my outbox.

Any other ideas of how to get it over to you?
For reference, in PHPBB (the forum software) if something is in the Outbox it just means the recipient has not yet viewed it. Once they have, it is moved to "Sent messages".