Debugging Nagios Failed Reload

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
delboy1966
Posts: 98
Joined: Thu Oct 22, 2015 5:26 am

Debugging Nagios Failed Reload

Post by delboy1966 »

This has been bugging me for a few weeks now but something I've not had time to investigate until now.

After making changes to config files and doing a reload of Nagios using the init script, at times Nagios will stop running, it also happens if I do a restart and not a reload.
We can make a lot of changes during the day so I could have to restart Nagios maybe 8 or 9 times a day. Out of those 8 or 9 times about 5 times I get the issue.
I do a reload and then a PS listing and find its not running anymore and then issue a start at least 3 times before its running again.

I've done what I can to try and debug it myself but can't find anything.
I have enabled all logging and debugging but see no errors in the logs.
I have disabled the 2 broker_modules I'm running, gearmand and livestatus and it still happens.

I need to find a way of tracing what the nagios reload is doing and where it fails, if anyone has any suggestions.

Running
Nagios Core 4.1.1 built from source
CentOS release 6.7 (Final) x86_64
Gearmand
Livestatus

Thanks in advance.

Tony
delboy1966
Posts: 98
Joined: Thu Oct 22, 2015 5:26 am

Re: Debugging Nagios Failed Reload

Post by delboy1966 »

Just to add...
I've tried running nagios with -uxd to skip circular path checks and also pre-cached configs already.

Tony
User avatar
nozlaf
Posts: 172
Joined: Sun Nov 09, 2014 9:50 pm
Location: Victoria, Australia

Re: Debugging Nagios Failed Reload

Post by nozlaf »

are you manually updating your config or using something like nagiosql or nconf to manage it?
Looking forward to seeing you all at #NagiosCon2019?
-Dedicated Lover of Nconf,PNP4Nagios and Nagvis
delboy1966
Posts: 98
Joined: Thu Oct 22, 2015 5:26 am

Re: Debugging Nagios Failed Reload

Post by delboy1966 »

I am manually updating the configs.
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Debugging Nagios Failed Reload

Post by rkennedy »

How many hosts / services are you monitoring? What kind of resources to do you have allocated to this machine? It sounds like you're hitting a throttle somewhere.

Can you post your nagios.log, and a tail of the syslog when this happens again for us to look at?
Former Nagios Employee
delboy1966
Posts: 98
Joined: Thu Oct 22, 2015 5:26 am

Re: Debugging Nagios Failed Reload

Post by delboy1966 »

I am monitoring:
567 Hosts
3781 Services
Which is really much based upon the numbers I've monitored previously at other companies.

The server has plenty or resources as its not doing any checks, mod_gearmand is installed and I have 6 worker nodes with mod_gearman_worker running on them.
When I did the reload this morning the Nagios process again stopped running, here is the tail of nagios.log file, which is exactly the same as the tail of /var/log/messages:

Code: Select all

[1458546546] Caught SIGHUP, restarting...
[1458546546] Event broker module 'NERD' deinitialized successfully.
[1458546546] Event broker module '/usr/lib64/mod_gearman/mod_gearman.o' deinitialized successfully.
[1458546547] livestatus: Socket thread has terminated
[1458546547] Event broker module '/usr/local/lib/mk-livestatus/livestatus.o' deinitialized successfully.
[1458546547] Nagios 4.1.1 starting... (PID=627)
[1458546547] Local time is Mon Mar 21 07:49:07 GMT 2016
[1458546547] LOG VERSION: 2.0
[1458546547] qh: Socket '/usr/local/nagios/var/rw/nagios.qh' successfully initialized
[1458546547] qh: core query handler registered
[1458546547] nerd: Channel hostchecks registered successfully
[1458546547] nerd: Channel servicechecks registered successfully
[1458546547] nerd: Channel opathchecks registered successfully
[1458546547] nerd: Fully initialized and ready to rock!
[1458546547] wproc: Successfully registered manager as @wproc with query handler
[1458546547] wproc: Registry request: name=Core Worker 2590;pid=2590
[1458546547] wproc: Registry request: name=Core Worker 2592;pid=2592
[1458546547] wproc: Registry request: name=Core Worker 2593;pid=2593
[1458546547] wproc: Registry request: name=Core Worker 2591;pid=2591
[1458546547] mod_gearman: initialized version 1.4_nagios4 (libgearman 0.25)
[1458546547] Event broker module '/usr/lib64/mod_gearman/mod_gearman.o' initialized successfully.
[1458546547] livestatus: Livestatus 1.2.7i3p2 by Mathias Kettner. Socket: '/usr/local/nagios/var/rw/live'
[1458546547] livestatus: Please visit us at http://mathias-kettner.de/
[1458546547] livestatus: Hint: please try out OMD - the Open Monitoring Distribution
[1458546547] livestatus: Please visit OMD at http://omdistro.org
[1458546547] livestatus: Finished initialization. Further log messages go to /usr/local/nagios/var/livestatus.log
[1458546547] Event broker module '/usr/local/lib/mk-livestatus/livestatus.o' initialized successfully.
[1458546548] TIMEPERIOD TRANSITION: 24x7;-1;1
[1458546548] TIMEPERIOD TRANSITION: 9am_only;-1;0
[1458546548] TIMEPERIOD TRANSITION: none;-1;0
[1458546548] TIMEPERIOD TRANSITION: once_a_day_at_10am;-1;0
[1458546548] TIMEPERIOD TRANSITION: once_a_day_at_730am;-1;0
[1458546548] TIMEPERIOD TRANSITION: weekdays_9_to_5;-1;0
[1458546548] TIMEPERIOD TRANSITION: weekdays_all_hours;-1;1
Then nothing after that until I restarted Nagios again.
I was also running a "top" when it failed to restart:

Code: Select all

Tasks: 171 total,   1 running, 169 sleeping,   1 stopped,   0 zombie
Cpu(s):  0.2%us,  0.2%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  12198380k total,  5481612k used,  6716768k free,   288152k buffers
Swap:  4095996k total,   121760k used,  3974236k free,  4242392k cached
I did wonder if mod_gearmand was causing a problem, where it maybe trying to pass check results to Nagios when it was doing a reload.
So I run the following:

Code: Select all

# /etc/rc.d/init.d/gearmand stop ; /etc/rc.d/init.d/nagios reload ; /etc/rc.d/init.d/gearmand start


To see if stopping gearmand before reloading Nagios and then restarting gearmand would solve the problem.
What I found was, it appears that stopping gearmand also stopped Nagios running. When Nagios tried to restart the PID it was running as no longer running.

Code: Select all

Stopping gearmand:                                         [  OK  ]
Running configuration check...
Stopping nagios:/etc/rc.d/init.d/nagios: line 140: kill: (7030) - No such process
 done.
Starting nagios: done.
Starting gearmand:                                         [  OK  ]
So I just run:

Code: Select all

# /etc/rc.d/init.d/gearmand stop


And when I checked to see if Nagios was running it wasn't.
Then it took me 4 attempts to start it again with:

Code: Select all

#/etc/rc.d/init.d/nagios start
Again all the logs showed was exactly the same as I've already posted.

Very confused....
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Debugging Nagios Failed Reload

Post by ssax »

I've seen a similar issue where setting the result_worker in the gearman configs to greater than 1 caused it to segfault (but not all the time). Starting had issues as well. Can you try changing yours to 1 and and see if that resolves the issue for you?
delboy1966
Posts: 98
Joined: Thu Oct 22, 2015 5:26 am

Re: Debugging Nagios Failed Reload

Post by delboy1966 »

I'll change it and see how it performs over the next couple of days and will report back.

Thanks
Tony
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Debugging Nagios Failed Reload

Post by rkennedy »

Sounds good - I'll leave this open and we will await your response.
Former Nagios Employee
delboy1966
Posts: 98
Joined: Thu Oct 22, 2015 5:26 am

Re: Debugging Nagios Failed Reload

Post by delboy1966 »

Again another issue solved.
I have restarted/reloaded Nagios a number of times today and yesterday and no failures.
Seems that was the solution.

Top marks again

Thanks
Tony

Thread can be closed.
Locked