Strange issue with Nagios stopping

An open discussion forum for obtaining help with Nagios Core. Nagios Core users of all experience levels are welcome here. Subforum have been created for the discussion of Nagios Core and Nagios Plugin development.

NOTE: The SourceForge.net mailing lists have been deprecated in favor of this forum in order to expedite support and provide additional features not available on the old mailing list.

Strange issue with Nagios stopping

Postby delboy1966 » Fri Sep 08, 2017 5:12 am

Over the past few weeks we have experienced times when Nagios just stops running.
Can't tie it down to anything, nothing on the Nagios server has been updated or installed.
The only thing that we have done is add new hosts and service checks.

When it stops with can take a few attempts to get it back up and running.
We start it and it runs maybe for a minute or so and the dies again.
There is nothing in the log file to indicate why it has stopped.

Has anyone ever seen this?
Is there any logging i can put in place to see why it does it?


Specs are:
Nagios 4.3.2 core
gearmand-0.25-1
Nagvis 1.8.5
Livestatus 1.2.7i3p2

Running on a VM, which it has been running on for over 2 years now.
64G RAM
6 x CPU

I have checked the config files using the -v switch with nagios and there are no errors and no warnings.

###
Running pre-flight check on configuration data...

Checking objects...
Checked 18167 services.
Checked 2554 hosts.
Checked 359 host groups.
Checked 63 service groups.
Checked 140 contacts.
Checked 40 contact groups.
Checked 345 commands.
Checked 33 time periods.
Checked 0 host escalations.
Checked 0 service escalations.
Checking for circular paths...
Checked 2554 hosts
Checked 0 service dependencies
Checked 0 host dependencies
Checked 33 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...

Total Warnings: 0
Total Errors: 0
###


Nagios config file:
##########
log_file=/usr/local/nagios/var/nagios.log
cfg_dir=/usr/local/nagios/etc/objects
object_cache_file=/usr/local/nagios/var/objects.cache
precached_object_file=/usr/local/nagios/var/objects.precache
resource_file=/usr/local/nagios/etc/resource.cfg
status_file=/usr/local/nagios/var/ramdisk/status.dat
status_update_interval=30
nagios_user=nagios
nagios_group=nagios
check_external_commands=1
command_file=/usr/local/nagios/var/rw/nagios.cmd
lock_file=/usr/local/nagios/var/nagios.lock
temp_file=/usr/local/nagios/var/nagios.tmp
temp_path=/usr/local/nagios/var/ramdisk/tmp
event_broker_options=-1
log_rotation_method=d
log_archive_path=/usr/local/nagios/var/archives/naglogs
use_syslog=0
log_notifications=1
log_service_retries=0
log_host_retries=0
log_event_handlers=0
log_initial_states=0
log_current_states=0
log_external_commands=0
log_passive_checks=0
global_host_event_handler=log_host_state_changes
service_inter_check_delay_method=s
max_service_check_spread=40
service_interleave_factor=s
host_inter_check_delay_method=s
max_host_check_spread=40
max_concurrent_checks=0
check_result_reaper_frequency=10
max_check_result_reaper_time=30
check_result_path=/usr/local/nagios/var/ramdisk/spool/checkresults
max_check_result_file_age=3600
cached_host_check_horizon=15
cached_service_check_horizon=15
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1
soft_state_dependencies=0
auto_reschedule_checks=0
auto_rescheduling_interval=30
auto_rescheduling_window=180
service_check_timeout=90
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5
retain_state_information=1
state_retention_file=/usr/local/nagios/var/retention.dat
retention_update_interval=60
use_retained_program_state=1
use_retained_scheduling_info=0
retained_host_attribute_mask=0
retained_service_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
interval_length=60
check_for_updates=0
bare_update_check=0
use_aggressive_host_checking=0
execute_service_checks=1
accept_passive_service_checks=1
execute_host_checks=1
accept_passive_host_checks=1
enable_notifications=1
enable_event_handlers=1
process_performance_data=1
obsess_over_services=1
ocsp_command=global_service_event
obsess_over_hosts=0
translate_passive_host_checks=0
passive_host_checks_are_soft=0
check_for_orphaned_services=1
check_for_orphaned_hosts=1
check_service_freshness=0
service_freshness_check_interval=60
service_check_timeout_state=u
check_host_freshness=0
host_freshness_check_interval=60
additional_freshness_latency=15
enable_flap_detection=0
low_service_flap_threshold=5.0
high_service_flap_threshold=20.0
low_host_flap_threshold=5.0
high_host_flap_threshold=20.0
date_format=euro
illegal_object_name_chars=`~!$%^*|'"<>?,()=
illegal_macro_output_chars=`~$&|'"<>
use_regexp_matching=0
use_true_regexp_matching=0
admin_email=nagios@localhost
admin_pager=pagenagios@localhost
daemon_dumps_core=0
use_large_installation_tweaks=1
enable_environment_macros=1
debug_level=0
debug_verbosity=0
debug_file=/usr/local/nagios/var/nagios.debug
max_debug_file_size=1000000
allow_empty_hostgroup_assignment=0
host_down_disable_service_checks=1
broker_module=/usr/local/lib/mk-livestatus/livestatus.o /usr/local/nagios/var/rw/live log_file=/usr/local/nagios/var/archives/naglogs/livestatus.log
broker_module=/usr/lib64/mod_gearman/mod_gearman.o config=/etc/mod_gearman/mod_gearman_neb.conf

##############


Thanks in advance.

Tony
delboy1966
 
Posts: 53
Joined: Thu Oct 22, 2015 5:26 am

Re: Strange issue with Nagios stopping

Postby scottwilkerson » Fri Sep 08, 2017 1:17 pm

was mod_gearman compiled using the nagios4 flag?

Is it possible to temporarily disable the mod_gearman and livestatus addons to make sure they aren't causing the service to stop

Finally can we verify we don't have more than one parent process
Code: Select all
ps-ef|grep bin/nagios
User avatar
scottwilkerson
CTO
 
Posts: 7620
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Re: Strange issue with Nagios stopping

Postby delboy1966 » Sun Sep 10, 2017 10:40 am

Hi

Thanks for the reply.

I installed mod_gearman from RPM mod_gearman-1.4_nagios4-1.el6.x86_64

Currently if I disable livestatus all our maps will be affected which isn't ideal. However I am in the process of getting a duplicate Nagios box provisioned which will be identical to the current live one but on different IP addresses. So I will be able to play about with that, but that will now be in 2 weeks when I return to work after a break.

The output of the ps command is:

nagios 7072 1 18 Sep09 ? 07:26:25 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 7074 7072 0 Sep09 ? 00:02:45 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7075 7072 0 Sep09 ? 00:02:44 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7076 7072 0 Sep09 ? 00:02:46 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7077 7072 0 Sep09 ? 00:02:46 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7078 7072 0 Sep09 ? 00:02:46 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7079 7072 0 Sep09 ? 00:02:44 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7080 7072 0 Sep09 ? 00:02:44 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7081 7072 0 Sep09 ? 00:02:45 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7082 7072 0 Sep09 ? 00:02:45 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7083 7072 0 Sep09 ? 00:00:08 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg


Tony
delboy1966
 
Posts: 53
Joined: Thu Oct 22, 2015 5:26 am

Re: Strange issue with Nagios stopping

Postby delboy1966 » Mon Sep 11, 2017 7:06 am

As an update.

I am told that this seems happens when Nagios has been reloaded after changes have been made.
Nagios is reload and runs sometimes for a couple of minutes and then stops and sometimes about 10 seconds and then quits.
delboy1966
 
Posts: 53
Joined: Thu Oct 22, 2015 5:26 am

Re: Strange issue with Nagios stopping

Postby tgriep » Mon Sep 11, 2017 4:39 pm

Do you see any errors in the messages log file?
Take a look at this files to see if there are any errors
Code: Select all
/var/log/messages


Try disabling livestatus or gearman to see if it runs longer, then it is does run longer, you know the next step to look at.
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
tgriep
Madmin
 
Posts: 6019
Joined: Thu Oct 30, 2014 9:02 am

Re: Strange issue with Nagios stopping

Postby scottwilkerson » Mon Sep 11, 2017 4:40 pm

Did this behavior start after upgrading to 4.3.2?

When it stops is the service no longer running?
User avatar
scottwilkerson
CTO
 
Posts: 7620
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Re: Strange issue with Nagios stopping

Postby delboy1966 » Tue Sep 12, 2017 2:52 am

In answer to the questions posted:

I upgraded to Nagios 4.3.2 around June time and this has only been happening for the past month, so I think the upgrade is unrelated. However when I return to work on 25th I plan to upgrade to the latest version.

When Nagios dies the process is no longer running, thus no longer doing any checks.

Nothing is displayed in /var/log/messages or nagios.log and nagios.debug to indicate why the process stopped.

I'm not currently able to unload livestatus or stop gearmand as its in a live environment and a lot depends on it running, especially maps from Nagvis.
However upon my return to work on 25th I will have a copy of the live Nagios box waiting for me to run in a test environment doing exactly the same checks as the live box. I can then unload livestatus and stop gearmand and do some testing.

Until the 25th I'm not able to do anything.
But will update this topic when I have some more info, so if it can be left open I would appreciate it.

Thanks all.
delboy1966
 
Posts: 53
Joined: Thu Oct 22, 2015 5:26 am

Re: Strange issue with Nagios stopping

Postby scottwilkerson » Tue Sep 12, 2017 7:56 am

Sounds good, we will leave it open.
User avatar
scottwilkerson
CTO
 
Posts: 7620
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Re: Strange issue with Nagios stopping

Postby danjoh » Tue Sep 12, 2017 9:01 am

Are you sure you do not have anything in the nagios.log?
Reason for asking is that we have a similar setup and we had issues where Nagios crashed and the one thing I was seeing in the log was "Caught SIGSEGV, shutting down...".
To workaround this I had to "manually" re-compile mod_gearman with the latest Nagios headers.

Short howto:
Unpack Nagios-4 sources
Unpack mod_gearman sources
Replace the headers in mod_gearman/include/nagios4/ with their "counterpart" in nagios/includes/
Replace the headers in mod_gearman/include/nagios4/lib/ with their "counterpart" in nagios/lib/
run ./configure with --disable-naemon-neb-module --disable-nagios3-neb-module (and any other parameter you need/want) for mod_gearman
And the continue with the build/install of mod_gearman as described in INSTALL/README.
--
D/\N
danjoh
 
Posts: 31
Joined: Mon Dec 07, 2015 10:43 am
Location: Zürich, Switzerland

Re: Strange issue with Nagios stopping

Postby scottwilkerson » Tue Sep 12, 2017 9:12 am

danjoh wrote:Are you sure you do not have anything in the nagios.log?
Reason for asking is that we have a similar setup and we had issues where Nagios crashed and the one thing I was seeing in the log was "Caught SIGSEGV, shutting down...".
To workaround this I had to "manually" re-compile mod_gearman with the latest Nagios headers.

Short howto:
Unpack Nagios-4 sources
Unpack mod_gearman sources
Replace the headers in mod_gearman/include/nagios4/ with their "counterpart" in nagios/includes/
Replace the headers in mod_gearman/include/nagios4/lib/ with their "counterpart" in nagios/lib/
run ./configure with --disable-naemon-neb-module --disable-nagios3-neb-module (and any other parameter you need/want) for mod_gearman
And the continue with the build/install of mod_gearman as described in INSTALL/README.


Thanks for sharing @danjoh!
User avatar
scottwilkerson
CTO
 
Posts: 7620
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Next

Return to Nagios Core

Who is online

Users browsing this forum: Bing [Bot] and 7 guests