Nagios 4.3.1 crashes when using mod_gearman

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
danjoh
Posts: 73
Joined: Mon Dec 07, 2015 10:43 am
Location: Zürich, Switzerland
Contact:

Nagios 4.3.1 crashes when using mod_gearman

Post by danjoh »

I have upgraded Nagios from 4.2.4 to 4.3.1 (luckily only on my development box) and now it crashes with a SIGSEGV / SIGTERM repeatedly (about once a minute).
For me it looks like a problem when a broker_module sends data "back" to nagios.

I base this on the following facts.

If I disable mod_gearman in nagios.cfg, everything works OK.
If I enable mod_gearman in nagios.cfg, but do not use it for host-/service-checks, everything works OK.
If I enable mod_gearman and use it for host-/service-checks it starts crashing.

Sadly, the only thing I can see in the nagios-log are:
Caught SIGSEGV, shutting down...
Caught SIGTERM, shutting down...

In the debug-log I do not see anything strange.
Here are my SW releases:
OS: RHEL 7.3
Nagios-Core 4.3.1 (build from source)
mod_gearman 3.0.1-1 (labs.consol.de)
gearmand 0.33-5 (labs.consol.de)

Running nagios under gdb I see the following when it crashes:

Code: Select all

Program received signal SIGSEGV, Segmentation fault.
clear_custom_vars (vars=vars@entry=0x7ffffffed940) at ../common/macros.c:2851
2851 my_free(this_customvariablesmember->variable_name);
Missing separate debuginfos, use: debuginfo-install boost-system-1.53.0-26.el7.x86_64 gearmand-0.33-5.x86_64 glibc-2.17-157.el7_3.1.x86_64 libgcc-4.8.5-11.el7.x86_64 libstdc++-4.8.5-11.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 sssd-client-1.14.0-43.el7_3.11.x86_64
(gdb) bt
#0 clear_custom_vars (vars=vars@entry=0x7ffffffed940) at ../common/macros.c:2851
#1 0x00005555555916bc in clear_contact_macros_r (mac=mac@entry=0x7ffffffed2e0) at ../common/macros.c:3001
#2 0x00005555555918b7 in clear_volatile_macros_r (mac=mac@entry=0x7ffffffed2e0) at ../common/macros.c:2870
#3 0x00007ffff64aaa9e in handle_svc_check (event_type=, data=0x7fffffffda30) at neb_module_nagios4/../neb_module/mod_gearman.c:851
#4 0x000055555556bb2f in neb_make_callbacks (callback_type=callback_type@entry=6, data=data@entry=0x7fffffffda30) at nebmods.c:529
#5 0x0000555555569f10 in broker_service_check (type=type@entry=704, flags=flags@entry=0, attr=attr@entry=0, svc=svc@entry=0x555555e97310, check_type=check_type@entry=0,
start_time=..., end_time=..., cmd=, latency=0, exectime=exectime@entry=0, timeout=timeout@entry=0, early_timeout=early_timeout@entry=0,
retcode=retcode@entry=0, cmdline=cmdline@entry=0x0, timestamp=timestamp@entry=0x0, cr=cr@entry=0x0) at broker.c:326
#6 0x000055555557172f in run_async_service_check (svc=svc@entry=0x555555e97310, check_options=check_options@entry=0, latency=latency@entry=0.0008800000068731606,
scheduled_check=scheduled_check@entry=1, reschedule_check=reschedule_check@entry=1, time_is_valid=time_is_valid@entry=0x7fffffffe29c,
preferred_time=preferred_time@entry=0x7fffffffe2a8) at checks.c:199
#7 0x0000555555571cb1 in run_scheduled_service_check (svc=svc@entry=0x555555e97310, check_options=0, latency=latency@entry=0.0008800000068731606) at checks.c:90
#8 0x0000555555587adb in handle_timed_event (event=event@entry=0x555555e8fc20) at events.c:1171
#9 0x0000555555588623 in event_execution_loop () at events.c:1110
#10 0x0000555555568a56 in main (argc=, argv=) at nagios.c:814
I have opened an issue @github for mod_gearman (https://github.com/sni/mod_gearman/issues/110) and what they say is:
This looks like it may be a Core bug. I was able to replicate with pre-built and compiled from source ModGearman modules.
Has anything changed with the NEB interface between 4.2.4 and 4.3.1?
Any suggestions on what could be wrong (I know you do not "support" 3:d party modules, but let us not start finger pointing, please).
If you need more debugging info, I would be glad to help.

Regards,
--
D/\N
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Nagios 4.3.1 crashes when using mod_gearman

Post by tmcdonald »

For reference, the gentleman you quoted does in fact work for Nagios so I'll speak with him directly about this and get back to you when I hear more.

Update: Our Core dev is aware of the issue and they're working on getting it resolved.
Former Nagios employee
bheden
Product Development Manager
Posts: 179
Joined: Thu Feb 13, 2014 9:50 am
Location: Nagios Enterprises

Re: Nagios 4.3.1 crashes when using mod_gearman

Post by bheden »

Based on my initial investigation, it isn't anything NEB specific. It looks like some change in the macro freeing code has occurred causing this issue. The Nagios Core developer is aware of the bug, and I promise I'll post an update to that ModGearman issue whenever he lets me know. ;)
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Nagios Enterprises
Senior Developer
danjoh
Posts: 73
Joined: Mon Dec 07, 2015 10:43 am
Location: Zürich, Switzerland
Contact:

Re: Nagios 4.3.1 crashes when using mod_gearman

Post by danjoh »

Thanks for the update/feedback.

Keep up the great work.
--
D/\N
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: Nagios 4.3.1 crashes when using mod_gearman

Post by dwhitfield »

Just spoke to the dev about this and he suggested checking back next week. Probably Thursday is a good day next week because Friday is a short day for support.
superman
Posts: 2
Joined: Mon Mar 14, 2016 10:16 pm

Re: Nagios 4.3.1 crashes when using mod_gearman

Post by superman »

Hi danjoh,

Any reason you upgrade from 4.2.4 to 4.3.1? is that any major bugs or problem in 4.2.4? as i read from 4.3.1 version history, seen there are major fixes.
fyi, we are planning to upgrade to 4.2.4 and our major concern are the stability.


4.3.1 – 02/23/2017

FIXES
Backed out the changes related to the “login page” and “logoff link” from 4.3.0, since it was causing problems
Service hard state generation and host hard or soft down status
Comments are duplicated through Nagios reload
host hourly value is incorrectly dumped as json boolean
Bug – Quick Search no longer allows search by IP
Config: status_update_interval can not be set to 1
Check attempts not increasing if nagios is reloaded
nagios hangs on reload while sending external command to cmd file
Feature Request: return code xxx out of bounds – include message as well
danjoh
Posts: 73
Joined: Mon Dec 07, 2015 10:43 am
Location: Zürich, Switzerland
Contact:

Re: Nagios 4.3.1 crashes when using mod_gearman

Post by danjoh »

Hi Superman,

The main reason to update where (among other things):

* Bug - Quick Search no longer allows search by IP (4.3.1)
* Fix for CVE-2016-6209 - The "corewindow" parameter (4.3.0)
* Added a login page, and a `Logoff` links (4.3.0)
* On the status map, the host name will be colored if services are not all OK. (4.3.0)
* Added "Clear flapping state" command on host and services detail pages (4.3.0)

And of-cause to test the latest and greatest ;-)
As long as mod_gearman integration (or remote workers) does not work, I will stay on 4.2.4 on out production host.
--
D/\N
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: Nagios 4.3.1 crashes when using mod_gearman

Post by dwhitfield »

@superman, we had a lot of problems with the new features in 4.3.0, but 4.3.1 has been really good so far (released just two days after 4.3.0). This mod_gearman issue is the only one I know about in 4.3.1, although I'm sure there are still some open bug reports on the github.
superman
Posts: 2
Joined: Mon Mar 14, 2016 10:16 pm

Re: Nagios 4.3.1 crashes when using mod_gearman

Post by superman »

Thanks dwhitfield & danjoh.

We will stay with current version till the Mod Gearman issue solve as it is very useful for distributed check.
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Nagios 4.3.1 crashes when using mod_gearman

Post by mcapra »

@danjoh did you have additional questions regarding this? Otherwise, we'll touch base once the issue has been more thoroughly investigated from the Nagios Core side of things.
Former Nagios employee
https://www.mcapra.com/
Locked