Page 1 of 3

Monitoring Engine Process failing to start

Posted: Tue Jul 29, 2014 11:30 am
by BanditBBS
The left three are green checks, the right three are blue exclamation marks. Under monitoring engine process is shows stopped and I can not get it to start.

EDIT: Don't know what happened, but it eventually restarted, was not working for 30 minutes or so....weird.

Re: Monitoring Engine Process failing to start

Posted: Tue Jul 29, 2014 1:05 pm
by BanditBBS
I just needed to apply changes again and its doing it again. We've only added a few host and service groups and made a few minor changes to service. This all of a sudden started happening today and I am seeing no errors.

I even rebooted the server as last resort and it is taking forever to start the monitoring engine.

Yeah, it took 15 minutes to show all 6 as green.

Re: Monitoring Engine Process failing to start

Posted: Tue Jul 29, 2014 2:56 pm
by sreinhardt
Well those check marks are updated via cron, but that should be run at least every 30 seconds. While they were showing as blue, was the nagios process started and appearing to check? I assume you tried, but write\verify not working or showing any errors either?

Re: Monitoring Engine Process failing to start

Posted: Tue Jul 29, 2014 3:13 pm
by BanditBBS
sreinhardt wrote:Well those check marks are updated via cron, but that should be run at least every 30 seconds. While they were showing as blue, was the nagios process started and appearing to check? I assume you tried, but write\verify not working or showing any errors either?
nagios process was up, but if you look at the Monitoring Engine Event Queue dashlet, everything just stacks up and hosts are all greyed out.

Write verify shows zero errors. Everything functions fine once it starts after 15 minutes.

Re: Monitoring Engine Process failing to start

Posted: Tue Jul 29, 2014 3:30 pm
by sreinhardt
How large of an environment are we talking, both hosts\service counts and system resources? 15 min seems like quite the delay for something that should be pretty instantaneous in most cases.

Re: Monitoring Engine Process failing to start

Posted: Tue Jul 29, 2014 3:39 pm
by BanditBBS
sreinhardt wrote:How large of an environment are we talking, both hosts\service counts and system resources? 15 min seems like quite the delay for something that should be pretty instantaneous in most cases.
This was working fine last night...all of a sudden started doing this today.

131 Hosts 1436 Services, so rather small. The server has 32GB ram and 8 cores.

Re: Monitoring Engine Process failing to start

Posted: Tue Jul 29, 2014 4:02 pm
by sreinhardt
small... more like complete and total overkill with those hardware specs. OK there goes that idea. Would you be willing to tail the nagios log durring a restart and send it over?

Code: Select all

tail -f /usr/local/nagios/var/nagios.log 2>&1 | tee -a /tmp/nagios.log &
service nagios restart
killall tail 

Re: Monitoring Engine Process failing to start

Posted: Tue Jul 29, 2014 4:28 pm
by BanditBBS
Here you go:

Code: Select all

[1406668437] Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
[1406668439] Nagios 4.0.7 starting... (PID=29516)
[1406668439] Local time is Tue Jul 29 16:13:59 CDT 2014
[1406668439] LOG VERSION: 2.0
[1406668439] qh: Socket '/usr/local/nagios/var/rw/nagios.qh' successfully initialized
[1406668439] qh: core query handler registered
[1406668439] nerd: Channel hostchecks registered successfully
[1406668439] nerd: Channel servicechecks registered successfully
[1406668439] nerd: Channel opathchecks registered successfully
[1406668439] nerd: Fully initialized and ready to rock!
[1406668439] wproc: Successfully registered manager as @wproc with query handler
[1406668439] wproc: Registry request: name=Core Worker 29519;pid=29519
[1406668439] wproc: Registry request: name=Core Worker 29518;pid=29518
[1406668439] wproc: Registry request: name=Core Worker 29520;pid=29520
[1406668439] wproc: Registry request: name=Core Worker 29525;pid=29525
[1406668439] wproc: Registry request: name=Core Worker 29521;pid=29521
[1406668439] wproc: Registry request: name=Core Worker 29524;pid=29524
[1406668439] wproc: Registry request: name=Core Worker 29522;pid=29522
[1406668439] wproc: Registry request: name=Core Worker 29527;pid=29527
[1406668439] wproc: Registry request: name=Core Worker 29528;pid=29528
[1406668439] wproc: Registry request: name=Core Worker 29526;pid=29526
[1406668439] wproc: Registry request: name=Core Worker 29523;pid=29523
[1406668439] wproc: Registry request: name=Core Worker 29529;pid=29529
[1406668439] ndomod: NDOMOD 2.0.0 (02-28-2014) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
[1406668439] ndomod: Successfully connected to data sink.  0 queued items to flush.
[1406668439] ndomod registered for process data
[1406668439] ndomod registered for log data'
[1406668439] ndomod registered for system command data'
[1406668439] ndomod registered for event handler data'
[1406668439] ndomod registered for notification data'
[1406668439] ndomod registered for comment data'
[1406668439] ndomod registered for downtime data'
[1406668439] ndomod registered for flapping data'
[1406668439] ndomod registered for program status data'
[1406668439] ndomod registered for host status data'
[1406668439] ndomod registered for service status data'
[1406668439] ndomod registered for adaptive program data'
[1406668439] ndomod registered for adaptive host data'
[1406668439] ndomod registered for adaptive service data'
[1406668439] ndomod registered for external command data'
[1406668439] ndomod registered for aggregated status data'
[1406668439] ndomod registered for retention data'
[1406668439] ndomod registered for contact data'
[1406668439] ndomod registered for contact notification data'
[1406668439] ndomod registered for acknowledgement data'
[1406668439] ndomod registered for state change data'
[1406668439] ndomod registered for contact status data'
[1406668439] ndomod registered for adaptive contact data'
[1406668439] Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
[1406668439] Warning: Host 'CDM Checks - Linux' has no default contacts or contactgroups defined!
[1406668439] Successfully launched command file worker with pid 29535
Once everything was green(15 mins) service check stuff started showing up.

EDIT: All of a sudden now it starts faster than the page can reload.....I just need a drink!

EDIT2: And hours later it starts take 10+ minutes to restart again. I'm beginning to think this is a load thing, not sure about anything at this point.

Re: Monitoring Engine Process failing to start

Posted: Wed Jul 30, 2014 11:23 am
by tmcdonald
You probably won't have any meaningful data during the long restarts, but do your localhost CPU load graphs show any patterns before the failures?

Re: Monitoring Engine Process failing to start

Posted: Wed Jul 30, 2014 11:31 am
by BanditBBS
Sort of.....
chart.jpeg
Load was never over 2.5 also, so I should have plenty of horsepower.