Page 1 of 2

Nagios 4.3.1 crashes when using mod_gearman

Posted: Wed Mar 15, 2017 3:04 am
by danjoh
I have upgraded Nagios from 4.2.4 to 4.3.1 (luckily only on my development box) and now it crashes with a SIGSEGV / SIGTERM repeatedly (about once a minute).
For me it looks like a problem when a broker_module sends data "back" to nagios.

I base this on the following facts.

If I disable mod_gearman in nagios.cfg, everything works OK.
If I enable mod_gearman in nagios.cfg, but do not use it for host-/service-checks, everything works OK.
If I enable mod_gearman and use it for host-/service-checks it starts crashing.

Sadly, the only thing I can see in the nagios-log are:
Caught SIGSEGV, shutting down...
Caught SIGTERM, shutting down...

In the debug-log I do not see anything strange.
Here are my SW releases:
OS: RHEL 7.3
Nagios-Core 4.3.1 (build from source)
mod_gearman 3.0.1-1 (labs.consol.de)
gearmand 0.33-5 (labs.consol.de)

Running nagios under gdb I see the following when it crashes:

Code: Select all

Program received signal SIGSEGV, Segmentation fault.
clear_custom_vars (vars=vars@entry=0x7ffffffed940) at ../common/macros.c:2851
2851 my_free(this_customvariablesmember->variable_name);
Missing separate debuginfos, use: debuginfo-install boost-system-1.53.0-26.el7.x86_64 gearmand-0.33-5.x86_64 glibc-2.17-157.el7_3.1.x86_64 libgcc-4.8.5-11.el7.x86_64 libstdc++-4.8.5-11.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 sssd-client-1.14.0-43.el7_3.11.x86_64
(gdb) bt
#0 clear_custom_vars (vars=vars@entry=0x7ffffffed940) at ../common/macros.c:2851
#1 0x00005555555916bc in clear_contact_macros_r (mac=mac@entry=0x7ffffffed2e0) at ../common/macros.c:3001
#2 0x00005555555918b7 in clear_volatile_macros_r (mac=mac@entry=0x7ffffffed2e0) at ../common/macros.c:2870
#3 0x00007ffff64aaa9e in handle_svc_check (event_type=, data=0x7fffffffda30) at neb_module_nagios4/../neb_module/mod_gearman.c:851
#4 0x000055555556bb2f in neb_make_callbacks (callback_type=callback_type@entry=6, data=data@entry=0x7fffffffda30) at nebmods.c:529
#5 0x0000555555569f10 in broker_service_check (type=type@entry=704, flags=flags@entry=0, attr=attr@entry=0, svc=svc@entry=0x555555e97310, check_type=check_type@entry=0,
start_time=..., end_time=..., cmd=, latency=0, exectime=exectime@entry=0, timeout=timeout@entry=0, early_timeout=early_timeout@entry=0,
retcode=retcode@entry=0, cmdline=cmdline@entry=0x0, timestamp=timestamp@entry=0x0, cr=cr@entry=0x0) at broker.c:326
#6 0x000055555557172f in run_async_service_check (svc=svc@entry=0x555555e97310, check_options=check_options@entry=0, latency=latency@entry=0.0008800000068731606,
scheduled_check=scheduled_check@entry=1, reschedule_check=reschedule_check@entry=1, time_is_valid=time_is_valid@entry=0x7fffffffe29c,
preferred_time=preferred_time@entry=0x7fffffffe2a8) at checks.c:199
#7 0x0000555555571cb1 in run_scheduled_service_check (svc=svc@entry=0x555555e97310, check_options=0, latency=latency@entry=0.0008800000068731606) at checks.c:90
#8 0x0000555555587adb in handle_timed_event (event=event@entry=0x555555e8fc20) at events.c:1171
#9 0x0000555555588623 in event_execution_loop () at events.c:1110
#10 0x0000555555568a56 in main (argc=, argv=) at nagios.c:814
I have opened an issue @github for mod_gearman (https://github.com/sni/mod_gearman/issues/110) and what they say is:
This looks like it may be a Core bug. I was able to replicate with pre-built and compiled from source ModGearman modules.
Has anything changed with the NEB interface between 4.2.4 and 4.3.1?
Any suggestions on what could be wrong (I know you do not "support" 3:d party modules, but let us not start finger pointing, please).
If you need more debugging info, I would be glad to help.

Regards,

Re: Nagios 4.3.1 crashes when using mod_gearman

Posted: Wed Mar 15, 2017 12:24 pm
by tmcdonald
For reference, the gentleman you quoted does in fact work for Nagios so I'll speak with him directly about this and get back to you when I hear more.

Update: Our Core dev is aware of the issue and they're working on getting it resolved.

Re: Nagios 4.3.1 crashes when using mod_gearman

Posted: Wed Mar 15, 2017 1:26 pm
by bheden
Based on my initial investigation, it isn't anything NEB specific. It looks like some change in the macro freeing code has occurred causing this issue. The Nagios Core developer is aware of the bug, and I promise I'll post an update to that ModGearman issue whenever he lets me know. ;)

Re: Nagios 4.3.1 crashes when using mod_gearman

Posted: Thu Mar 16, 2017 2:13 am
by danjoh
Thanks for the update/feedback.

Keep up the great work.

Re: Nagios 4.3.1 crashes when using mod_gearman

Posted: Thu Mar 16, 2017 3:27 pm
by dwhitfield
Just spoke to the dev about this and he suggested checking back next week. Probably Thursday is a good day next week because Friday is a short day for support.

Re: Nagios 4.3.1 crashes when using mod_gearman

Posted: Wed Mar 22, 2017 4:10 am
by superman
Hi danjoh,

Any reason you upgrade from 4.2.4 to 4.3.1? is that any major bugs or problem in 4.2.4? as i read from 4.3.1 version history, seen there are major fixes.
fyi, we are planning to upgrade to 4.2.4 and our major concern are the stability.


4.3.1 – 02/23/2017

FIXES
Backed out the changes related to the “login page” and “logoff link” from 4.3.0, since it was causing problems
Service hard state generation and host hard or soft down status
Comments are duplicated through Nagios reload
host hourly value is incorrectly dumped as json boolean
Bug – Quick Search no longer allows search by IP
Config: status_update_interval can not be set to 1
Check attempts not increasing if nagios is reloaded
nagios hangs on reload while sending external command to cmd file
Feature Request: return code xxx out of bounds – include message as well

Re: Nagios 4.3.1 crashes when using mod_gearman

Posted: Wed Mar 22, 2017 10:05 am
by danjoh
Hi Superman,

The main reason to update where (among other things):

* Bug - Quick Search no longer allows search by IP (4.3.1)
* Fix for CVE-2016-6209 - The "corewindow" parameter (4.3.0)
* Added a login page, and a `Logoff` links (4.3.0)
* On the status map, the host name will be colored if services are not all OK. (4.3.0)
* Added "Clear flapping state" command on host and services detail pages (4.3.0)

And of-cause to test the latest and greatest ;-)
As long as mod_gearman integration (or remote workers) does not work, I will stay on 4.2.4 on out production host.

Re: Nagios 4.3.1 crashes when using mod_gearman

Posted: Wed Mar 22, 2017 10:39 am
by dwhitfield
@superman, we had a lot of problems with the new features in 4.3.0, but 4.3.1 has been really good so far (released just two days after 4.3.0). This mod_gearman issue is the only one I know about in 4.3.1, although I'm sure there are still some open bug reports on the github.

Re: Nagios 4.3.1 crashes when using mod_gearman

Posted: Fri Mar 24, 2017 1:38 am
by superman
Thanks dwhitfield & danjoh.

We will stay with current version till the Mod Gearman issue solve as it is very useful for distributed check.

Re: Nagios 4.3.1 crashes when using mod_gearman

Posted: Fri Mar 24, 2017 12:59 pm
by mcapra
@danjoh did you have additional questions regarding this? Otherwise, we'll touch base once the issue has been more thoroughly investigated from the Nagios Core side of things.