Debugging Nagios Failed Reload
-
delboy1966
- Posts: 98
- Joined: Thu Oct 22, 2015 5:26 am
Debugging Nagios Failed Reload
This has been bugging me for a few weeks now but something I've not had time to investigate until now.
After making changes to config files and doing a reload of Nagios using the init script, at times Nagios will stop running, it also happens if I do a restart and not a reload.
We can make a lot of changes during the day so I could have to restart Nagios maybe 8 or 9 times a day. Out of those 8 or 9 times about 5 times I get the issue.
I do a reload and then a PS listing and find its not running anymore and then issue a start at least 3 times before its running again.
I've done what I can to try and debug it myself but can't find anything.
I have enabled all logging and debugging but see no errors in the logs.
I have disabled the 2 broker_modules I'm running, gearmand and livestatus and it still happens.
I need to find a way of tracing what the nagios reload is doing and where it fails, if anyone has any suggestions.
Running
Nagios Core 4.1.1 built from source
CentOS release 6.7 (Final) x86_64
Gearmand
Livestatus
Thanks in advance.
Tony
After making changes to config files and doing a reload of Nagios using the init script, at times Nagios will stop running, it also happens if I do a restart and not a reload.
We can make a lot of changes during the day so I could have to restart Nagios maybe 8 or 9 times a day. Out of those 8 or 9 times about 5 times I get the issue.
I do a reload and then a PS listing and find its not running anymore and then issue a start at least 3 times before its running again.
I've done what I can to try and debug it myself but can't find anything.
I have enabled all logging and debugging but see no errors in the logs.
I have disabled the 2 broker_modules I'm running, gearmand and livestatus and it still happens.
I need to find a way of tracing what the nagios reload is doing and where it fails, if anyone has any suggestions.
Running
Nagios Core 4.1.1 built from source
CentOS release 6.7 (Final) x86_64
Gearmand
Livestatus
Thanks in advance.
Tony
-
delboy1966
- Posts: 98
- Joined: Thu Oct 22, 2015 5:26 am
Re: Debugging Nagios Failed Reload
Just to add...
I've tried running nagios with -uxd to skip circular path checks and also pre-cached configs already.
Tony
I've tried running nagios with -uxd to skip circular path checks and also pre-cached configs already.
Tony
Re: Debugging Nagios Failed Reload
are you manually updating your config or using something like nagiosql or nconf to manage it?
Looking forward to seeing you all at #NagiosCon2019?
-Dedicated Lover of Nconf,PNP4Nagios and Nagvis
-Dedicated Lover of Nconf,PNP4Nagios and Nagvis
-
delboy1966
- Posts: 98
- Joined: Thu Oct 22, 2015 5:26 am
Re: Debugging Nagios Failed Reload
I am manually updating the configs.
Re: Debugging Nagios Failed Reload
How many hosts / services are you monitoring? What kind of resources to do you have allocated to this machine? It sounds like you're hitting a throttle somewhere.
Can you post your nagios.log, and a tail of the syslog when this happens again for us to look at?
Can you post your nagios.log, and a tail of the syslog when this happens again for us to look at?
Former Nagios Employee
-
delboy1966
- Posts: 98
- Joined: Thu Oct 22, 2015 5:26 am
Re: Debugging Nagios Failed Reload
I am monitoring:
567 Hosts
3781 Services
Which is really much based upon the numbers I've monitored previously at other companies.
The server has plenty or resources as its not doing any checks, mod_gearmand is installed and I have 6 worker nodes with mod_gearman_worker running on them.
When I did the reload this morning the Nagios process again stopped running, here is the tail of nagios.log file, which is exactly the same as the tail of /var/log/messages:
Then nothing after that until I restarted Nagios again.
I was also running a "top" when it failed to restart:
I did wonder if mod_gearmand was causing a problem, where it maybe trying to pass check results to Nagios when it was doing a reload.
So I run the following:
To see if stopping gearmand before reloading Nagios and then restarting gearmand would solve the problem.
What I found was, it appears that stopping gearmand also stopped Nagios running. When Nagios tried to restart the PID it was running as no longer running.
So I just run:
And when I checked to see if Nagios was running it wasn't.
Then it took me 4 attempts to start it again with:
Again all the logs showed was exactly the same as I've already posted.
Very confused....
567 Hosts
3781 Services
Which is really much based upon the numbers I've monitored previously at other companies.
The server has plenty or resources as its not doing any checks, mod_gearmand is installed and I have 6 worker nodes with mod_gearman_worker running on them.
When I did the reload this morning the Nagios process again stopped running, here is the tail of nagios.log file, which is exactly the same as the tail of /var/log/messages:
Code: Select all
[1458546546] Caught SIGHUP, restarting...
[1458546546] Event broker module 'NERD' deinitialized successfully.
[1458546546] Event broker module '/usr/lib64/mod_gearman/mod_gearman.o' deinitialized successfully.
[1458546547] livestatus: Socket thread has terminated
[1458546547] Event broker module '/usr/local/lib/mk-livestatus/livestatus.o' deinitialized successfully.
[1458546547] Nagios 4.1.1 starting... (PID=627)
[1458546547] Local time is Mon Mar 21 07:49:07 GMT 2016
[1458546547] LOG VERSION: 2.0
[1458546547] qh: Socket '/usr/local/nagios/var/rw/nagios.qh' successfully initialized
[1458546547] qh: core query handler registered
[1458546547] nerd: Channel hostchecks registered successfully
[1458546547] nerd: Channel servicechecks registered successfully
[1458546547] nerd: Channel opathchecks registered successfully
[1458546547] nerd: Fully initialized and ready to rock!
[1458546547] wproc: Successfully registered manager as @wproc with query handler
[1458546547] wproc: Registry request: name=Core Worker 2590;pid=2590
[1458546547] wproc: Registry request: name=Core Worker 2592;pid=2592
[1458546547] wproc: Registry request: name=Core Worker 2593;pid=2593
[1458546547] wproc: Registry request: name=Core Worker 2591;pid=2591
[1458546547] mod_gearman: initialized version 1.4_nagios4 (libgearman 0.25)
[1458546547] Event broker module '/usr/lib64/mod_gearman/mod_gearman.o' initialized successfully.
[1458546547] livestatus: Livestatus 1.2.7i3p2 by Mathias Kettner. Socket: '/usr/local/nagios/var/rw/live'
[1458546547] livestatus: Please visit us at http://mathias-kettner.de/
[1458546547] livestatus: Hint: please try out OMD - the Open Monitoring Distribution
[1458546547] livestatus: Please visit OMD at http://omdistro.org
[1458546547] livestatus: Finished initialization. Further log messages go to /usr/local/nagios/var/livestatus.log
[1458546547] Event broker module '/usr/local/lib/mk-livestatus/livestatus.o' initialized successfully.
[1458546548] TIMEPERIOD TRANSITION: 24x7;-1;1
[1458546548] TIMEPERIOD TRANSITION: 9am_only;-1;0
[1458546548] TIMEPERIOD TRANSITION: none;-1;0
[1458546548] TIMEPERIOD TRANSITION: once_a_day_at_10am;-1;0
[1458546548] TIMEPERIOD TRANSITION: once_a_day_at_730am;-1;0
[1458546548] TIMEPERIOD TRANSITION: weekdays_9_to_5;-1;0
[1458546548] TIMEPERIOD TRANSITION: weekdays_all_hours;-1;1
I was also running a "top" when it failed to restart:
Code: Select all
Tasks: 171 total, 1 running, 169 sleeping, 1 stopped, 0 zombie
Cpu(s): 0.2%us, 0.2%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 12198380k total, 5481612k used, 6716768k free, 288152k buffers
Swap: 4095996k total, 121760k used, 3974236k free, 4242392k cached
So I run the following:
Code: Select all
# /etc/rc.d/init.d/gearmand stop ; /etc/rc.d/init.d/nagios reload ; /etc/rc.d/init.d/gearmand startTo see if stopping gearmand before reloading Nagios and then restarting gearmand would solve the problem.
What I found was, it appears that stopping gearmand also stopped Nagios running. When Nagios tried to restart the PID it was running as no longer running.
Code: Select all
Stopping gearmand: [ OK ]
Running configuration check...
Stopping nagios:/etc/rc.d/init.d/nagios: line 140: kill: (7030) - No such process
done.
Starting nagios: done.
Starting gearmand: [ OK ]Code: Select all
# /etc/rc.d/init.d/gearmand stopAnd when I checked to see if Nagios was running it wasn't.
Then it took me 4 attempts to start it again with:
Code: Select all
#/etc/rc.d/init.d/nagios startVery confused....
Re: Debugging Nagios Failed Reload
I've seen a similar issue where setting the result_worker in the gearman configs to greater than 1 caused it to segfault (but not all the time). Starting had issues as well. Can you try changing yours to 1 and and see if that resolves the issue for you?
-
delboy1966
- Posts: 98
- Joined: Thu Oct 22, 2015 5:26 am
Re: Debugging Nagios Failed Reload
I'll change it and see how it performs over the next couple of days and will report back.
Thanks
Tony
Thanks
Tony
Re: Debugging Nagios Failed Reload
Sounds good - I'll leave this open and we will await your response.
Former Nagios Employee
-
delboy1966
- Posts: 98
- Joined: Thu Oct 22, 2015 5:26 am
Re: Debugging Nagios Failed Reload
Again another issue solved.
I have restarted/reloaded Nagios a number of times today and yesterday and no failures.
Seems that was the solution.
Top marks again
Thanks
Tony
Thread can be closed.
I have restarted/reloaded Nagios a number of times today and yesterday and no failures.
Seems that was the solution.
Top marks again
Thanks
Tony
Thread can be closed.