Debugging Nagios Failed Reload

delboy1966 · Post by **delboy1966** » Thu Mar 17, 2016 3:48 am

This has been bugging me for a few weeks now but something I've not had time to investigate until now.

After making changes to config files and doing a reload of Nagios using the init script, at times Nagios will stop running, it also happens if I do a restart and not a reload.
We can make a lot of changes during the day so I could have to restart Nagios maybe 8 or 9 times a day. Out of those 8 or 9 times about 5 times I get the issue.
I do a reload and then a PS listing and find its not running anymore and then issue a start at least 3 times before its running again.

I've done what I can to try and debug it myself but can't find anything.
I have enabled all logging and debugging but see no errors in the logs.
I have disabled the 2 broker_modules I'm running, gearmand and livestatus and it still happens.

I need to find a way of tracing what the nagios reload is doing and where it fails, if anyone has any suggestions.

Running
Nagios Core 4.1.1 built from source
CentOS release 6.7 (Final) x86_64
Gearmand
Livestatus

Thanks in advance.

Tony

delboy1966 · Post by **delboy1966** » Thu Mar 17, 2016 3:51 am

Just to add...
I've tried running nagios with -uxd to skip circular path checks and also pre-cached configs already.

Tony

nozlaf · Post by **nozlaf** » Thu Mar 17, 2016 6:33 am

are you manually updating your config or using something like nagiosql or nconf to manage it?

delboy1966 · Post by **delboy1966** » Thu Mar 17, 2016 8:49 am

I am manually updating the configs.

rkennedy · Post by **rkennedy** » Thu Mar 17, 2016 1:11 pm

How many hosts / services are you monitoring? What kind of resources to do you have allocated to this machine? It sounds like you're hitting a throttle somewhere.

Can you post your nagios.log, and a tail of the syslog when this happens again for us to look at?

delboy1966 · Post by **delboy1966** » Mon Mar 21, 2016 3:17 am

I am monitoring:
567 Hosts
3781 Services
Which is really much based upon the numbers I've monitored previously at other companies.

The server has plenty or resources as its not doing any checks, mod_gearmand is installed and I have 6 worker nodes with mod_gearman_worker running on them.
When I did the reload this morning the Nagios process again stopped running, here is the tail of nagios.log file, which is exactly the same as the tail of /var/log/messages:

Code: Select all

[1458546546] Caught SIGHUP, restarting...
[1458546546] Event broker module 'NERD' deinitialized successfully.
[1458546546] Event broker module '/usr/lib64/mod_gearman/mod_gearman.o' deinitialized successfully.
[1458546547] livestatus: Socket thread has terminated
[1458546547] Event broker module '/usr/local/lib/mk-livestatus/livestatus.o' deinitialized successfully.
[1458546547] Nagios 4.1.1 starting... (PID=627)
[1458546547] Local time is Mon Mar 21 07:49:07 GMT 2016
[1458546547] LOG VERSION: 2.0
[1458546547] qh: Socket '/usr/local/nagios/var/rw/nagios.qh' successfully initialized
[1458546547] qh: core query handler registered
[1458546547] nerd: Channel hostchecks registered successfully
[1458546547] nerd: Channel servicechecks registered successfully
[1458546547] nerd: Channel opathchecks registered successfully
[1458546547] nerd: Fully initialized and ready to rock!
[1458546547] wproc: Successfully registered manager as @wproc with query handler
[1458546547] wproc: Registry request: name=Core Worker 2590;pid=2590
[1458546547] wproc: Registry request: name=Core Worker 2592;pid=2592
[1458546547] wproc: Registry request: name=Core Worker 2593;pid=2593
[1458546547] wproc: Registry request: name=Core Worker 2591;pid=2591
[1458546547] mod_gearman: initialized version 1.4_nagios4 (libgearman 0.25)
[1458546547] Event broker module '/usr/lib64/mod_gearman/mod_gearman.o' initialized successfully.
[1458546547] livestatus: Livestatus 1.2.7i3p2 by Mathias Kettner. Socket: '/usr/local/nagios/var/rw/live'
[1458546547] livestatus: Please visit us at http://mathias-kettner.de/
[1458546547] livestatus: Hint: please try out OMD - the Open Monitoring Distribution
[1458546547] livestatus: Please visit OMD at http://omdistro.org
[1458546547] livestatus: Finished initialization. Further log messages go to /usr/local/nagios/var/livestatus.log
[1458546547] Event broker module '/usr/local/lib/mk-livestatus/livestatus.o' initialized successfully.
[1458546548] TIMEPERIOD TRANSITION: 24x7;-1;1
[1458546548] TIMEPERIOD TRANSITION: 9am_only;-1;0
[1458546548] TIMEPERIOD TRANSITION: none;-1;0
[1458546548] TIMEPERIOD TRANSITION: once_a_day_at_10am;-1;0
[1458546548] TIMEPERIOD TRANSITION: once_a_day_at_730am;-1;0
[1458546548] TIMEPERIOD TRANSITION: weekdays_9_to_5;-1;0
[1458546548] TIMEPERIOD TRANSITION: weekdays_all_hours;-1;1

Then nothing after that until I restarted Nagios again.
I was also running a "top" when it failed to restart:

Code: Select all

Tasks: 171 total,   1 running, 169 sleeping,   1 stopped,   0 zombie
Cpu(s):  0.2%us,  0.2%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  12198380k total,  5481612k used,  6716768k free,   288152k buffers
Swap:  4095996k total,   121760k used,  3974236k free,  4242392k cached

I did wonder if mod_gearmand was causing a problem, where it maybe trying to pass check results to Nagios when it was doing a reload.
So I run the following:

Code: Select all

# /etc/rc.d/init.d/gearmand stop ; /etc/rc.d/init.d/nagios reload ; /etc/rc.d/init.d/gearmand start

To see if stopping gearmand before reloading Nagios and then restarting gearmand would solve the problem.
What I found was, it appears that stopping gearmand also stopped Nagios running. When Nagios tried to restart the PID it was running as no longer running.

Code: Select all

Stopping gearmand:                                         [  OK  ]
Running configuration check...
Stopping nagios:/etc/rc.d/init.d/nagios: line 140: kill: (7030) - No such process
 done.
Starting nagios: done.
Starting gearmand:                                         [  OK  ]

So I just run:

Code: Select all

# /etc/rc.d/init.d/gearmand stop

And when I checked to see if Nagios was running it wasn't.
Then it took me 4 attempts to start it again with:

Code: Select all

#/etc/rc.d/init.d/nagios start

Again all the logs showed was exactly the same as I've already posted.

Very confused....

ssax · Post by **ssax** » Mon Mar 21, 2016 4:13 pm

I've seen a similar issue where setting the result_worker in the gearman configs to greater than 1 caused it to segfault (but not all the time). Starting had issues as well. Can you try changing yours to 1 and and see if that resolves the issue for you?

delboy1966 · Post by **delboy1966** » Wed Mar 23, 2016 3:33 am

I'll change it and see how it performs over the next couple of days and will report back.

Thanks
Tony

rkennedy · Post by **rkennedy** » Wed Mar 23, 2016 9:22 am

Sounds good - I'll leave this open and we will await your response.

delboy1966 · Post by **delboy1966** » Thu Mar 24, 2016 8:26 am

Again another issue solved.
I have restarted/reloaded Nagios a number of times today and yesterday and no failures.
Seems that was the solution.

Top marks again

Thanks
Tony

Thread can be closed.

Nagios Support Forum

Debugging Nagios Failed Reload

Debugging Nagios Failed Reload

Re: Debugging Nagios Failed Reload

Re: Debugging Nagios Failed Reload

Re: Debugging Nagios Failed Reload

Re: Debugging Nagios Failed Reload

Re: Debugging Nagios Failed Reload

Re: Debugging Nagios Failed Reload

Re: Debugging Nagios Failed Reload

Re: Debugging Nagios Failed Reload

Re: Debugging Nagios Failed Reload