Strange issue with Nagios stopping

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
delboy1966
Posts: 94
Joined: Thu Oct 22, 2015 5:26 am

Strange issue with Nagios stopping

Post by delboy1966 »

Over the past few weeks we have experienced times when Nagios just stops running.
Can't tie it down to anything, nothing on the Nagios server has been updated or installed.
The only thing that we have done is add new hosts and service checks.

When it stops with can take a few attempts to get it back up and running.
We start it and it runs maybe for a minute or so and the dies again.
There is nothing in the log file to indicate why it has stopped.

Has anyone ever seen this?
Is there any logging i can put in place to see why it does it?


Specs are:
Nagios 4.3.2 core
gearmand-0.25-1
Nagvis 1.8.5
Livestatus 1.2.7i3p2

Running on a VM, which it has been running on for over 2 years now.
64G RAM
6 x CPU

I have checked the config files using the -v switch with nagios and there are no errors and no warnings.

###
Running pre-flight check on configuration data...

Checking objects...
Checked 18167 services.
Checked 2554 hosts.
Checked 359 host groups.
Checked 63 service groups.
Checked 140 contacts.
Checked 40 contact groups.
Checked 345 commands.
Checked 33 time periods.
Checked 0 host escalations.
Checked 0 service escalations.
Checking for circular paths...
Checked 2554 hosts
Checked 0 service dependencies
Checked 0 host dependencies
Checked 33 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...

Total Warnings: 0
Total Errors: 0
###


Nagios config file:
##########
log_file=/usr/local/nagios/var/nagios.log
cfg_dir=/usr/local/nagios/etc/objects
object_cache_file=/usr/local/nagios/var/objects.cache
precached_object_file=/usr/local/nagios/var/objects.precache
resource_file=/usr/local/nagios/etc/resource.cfg
status_file=/usr/local/nagios/var/ramdisk/status.dat
status_update_interval=30
nagios_user=nagios
nagios_group=nagios
check_external_commands=1
command_file=/usr/local/nagios/var/rw/nagios.cmd
lock_file=/usr/local/nagios/var/nagios.lock
temp_file=/usr/local/nagios/var/nagios.tmp
temp_path=/usr/local/nagios/var/ramdisk/tmp
event_broker_options=-1
log_rotation_method=d
log_archive_path=/usr/local/nagios/var/archives/naglogs
use_syslog=0
log_notifications=1
log_service_retries=0
log_host_retries=0
log_event_handlers=0
log_initial_states=0
log_current_states=0
log_external_commands=0
log_passive_checks=0
global_host_event_handler=log_host_state_changes
service_inter_check_delay_method=s
max_service_check_spread=40
service_interleave_factor=s
host_inter_check_delay_method=s
max_host_check_spread=40
max_concurrent_checks=0
check_result_reaper_frequency=10
max_check_result_reaper_time=30
check_result_path=/usr/local/nagios/var/ramdisk/spool/checkresults
max_check_result_file_age=3600
cached_host_check_horizon=15
cached_service_check_horizon=15
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1
soft_state_dependencies=0
auto_reschedule_checks=0
auto_rescheduling_interval=30
auto_rescheduling_window=180
service_check_timeout=90
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5
retain_state_information=1
state_retention_file=/usr/local/nagios/var/retention.dat
retention_update_interval=60
use_retained_program_state=1
use_retained_scheduling_info=0
retained_host_attribute_mask=0
retained_service_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
interval_length=60
check_for_updates=0
bare_update_check=0
use_aggressive_host_checking=0
execute_service_checks=1
accept_passive_service_checks=1
execute_host_checks=1
accept_passive_host_checks=1
enable_notifications=1
enable_event_handlers=1
process_performance_data=1
obsess_over_services=1
ocsp_command=global_service_event
obsess_over_hosts=0
translate_passive_host_checks=0
passive_host_checks_are_soft=0
check_for_orphaned_services=1
check_for_orphaned_hosts=1
check_service_freshness=0
service_freshness_check_interval=60
service_check_timeout_state=u
check_host_freshness=0
host_freshness_check_interval=60
additional_freshness_latency=15
enable_flap_detection=0
low_service_flap_threshold=5.0
high_service_flap_threshold=20.0
low_host_flap_threshold=5.0
high_host_flap_threshold=20.0
date_format=euro
illegal_object_name_chars=`~!$%^*|'"<>?,()=
illegal_macro_output_chars=`~$&|'"<>
use_regexp_matching=0
use_true_regexp_matching=0
admin_email=nagios@localhost
admin_pager=pagenagios@localhost
daemon_dumps_core=0
use_large_installation_tweaks=1
enable_environment_macros=1
debug_level=0
debug_verbosity=0
debug_file=/usr/local/nagios/var/nagios.debug
max_debug_file_size=1000000
allow_empty_hostgroup_assignment=0
host_down_disable_service_checks=1
broker_module=/usr/local/lib/mk-livestatus/livestatus.o /usr/local/nagios/var/rw/live log_file=/usr/local/nagios/var/archives/naglogs/livestatus.log
broker_module=/usr/lib64/mod_gearman/mod_gearman.o config=/etc/mod_gearman/mod_gearman_neb.conf

##############


Thanks in advance.

Tony
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Strange issue with Nagios stopping

Post by scottwilkerson »

was mod_gearman compiled using the nagios4 flag?

Is it possible to temporarily disable the mod_gearman and livestatus addons to make sure they aren't causing the service to stop

Finally can we verify we don't have more than one parent process

Code: Select all

ps-ef|grep bin/nagios
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
delboy1966
Posts: 94
Joined: Thu Oct 22, 2015 5:26 am

Re: Strange issue with Nagios stopping

Post by delboy1966 »

Hi

Thanks for the reply.

I installed mod_gearman from RPM mod_gearman-1.4_nagios4-1.el6.x86_64

Currently if I disable livestatus all our maps will be affected which isn't ideal. However I am in the process of getting a duplicate Nagios box provisioned which will be identical to the current live one but on different IP addresses. So I will be able to play about with that, but that will now be in 2 weeks when I return to work after a break.

The output of the ps command is:

nagios 7072 1 18 Sep09 ? 07:26:25 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 7074 7072 0 Sep09 ? 00:02:45 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7075 7072 0 Sep09 ? 00:02:44 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7076 7072 0 Sep09 ? 00:02:46 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7077 7072 0 Sep09 ? 00:02:46 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7078 7072 0 Sep09 ? 00:02:46 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7079 7072 0 Sep09 ? 00:02:44 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7080 7072 0 Sep09 ? 00:02:44 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7081 7072 0 Sep09 ? 00:02:45 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7082 7072 0 Sep09 ? 00:02:45 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 7083 7072 0 Sep09 ? 00:00:08 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg


Tony
delboy1966
Posts: 94
Joined: Thu Oct 22, 2015 5:26 am

Re: Strange issue with Nagios stopping

Post by delboy1966 »

As an update.

I am told that this seems happens when Nagios has been reloaded after changes have been made.
Nagios is reload and runs sometimes for a couple of minutes and then stops and sometimes about 10 seconds and then quits.
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Strange issue with Nagios stopping

Post by tgriep »

Do you see any errors in the messages log file?
Take a look at this files to see if there are any errors

Code: Select all

/var/log/messages
Try disabling livestatus or gearman to see if it runs longer, then it is does run longer, you know the next step to look at.
Be sure to check out our Knowledgebase for helpful articles and solutions!
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Strange issue with Nagios stopping

Post by scottwilkerson »

Did this behavior start after upgrading to 4.3.2?

When it stops is the service no longer running?
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
delboy1966
Posts: 94
Joined: Thu Oct 22, 2015 5:26 am

Re: Strange issue with Nagios stopping

Post by delboy1966 »

In answer to the questions posted:

I upgraded to Nagios 4.3.2 around June time and this has only been happening for the past month, so I think the upgrade is unrelated. However when I return to work on 25th I plan to upgrade to the latest version.

When Nagios dies the process is no longer running, thus no longer doing any checks.

Nothing is displayed in /var/log/messages or nagios.log and nagios.debug to indicate why the process stopped.

I'm not currently able to unload livestatus or stop gearmand as its in a live environment and a lot depends on it running, especially maps from Nagvis.
However upon my return to work on 25th I will have a copy of the live Nagios box waiting for me to run in a test environment doing exactly the same checks as the live box. I can then unload livestatus and stop gearmand and do some testing.

Until the 25th I'm not able to do anything.
But will update this topic when I have some more info, so if it can be left open I would appreciate it.

Thanks all.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Strange issue with Nagios stopping

Post by scottwilkerson »

Sounds good, we will leave it open.
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
danjoh
Posts: 73
Joined: Mon Dec 07, 2015 10:43 am
Location: Zürich, Switzerland
Contact:

Re: Strange issue with Nagios stopping

Post by danjoh »

Are you sure you do not have anything in the nagios.log?
Reason for asking is that we have a similar setup and we had issues where Nagios crashed and the one thing I was seeing in the log was "Caught SIGSEGV, shutting down...".
To workaround this I had to "manually" re-compile mod_gearman with the latest Nagios headers.

Short howto:
Unpack Nagios-4 sources
Unpack mod_gearman sources
Replace the headers in mod_gearman/include/nagios4/ with their "counterpart" in nagios/includes/
Replace the headers in mod_gearman/include/nagios4/lib/ with their "counterpart" in nagios/lib/
run ./configure with --disable-naemon-neb-module --disable-nagios3-neb-module (and any other parameter you need/want) for mod_gearman
And the continue with the build/install of mod_gearman as described in INSTALL/README.
--
D/\N
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Strange issue with Nagios stopping

Post by scottwilkerson »

danjoh wrote:Are you sure you do not have anything in the nagios.log?
Reason for asking is that we have a similar setup and we had issues where Nagios crashed and the one thing I was seeing in the log was "Caught SIGSEGV, shutting down...".
To workaround this I had to "manually" re-compile mod_gearman with the latest Nagios headers.

Short howto:
Unpack Nagios-4 sources
Unpack mod_gearman sources
Replace the headers in mod_gearman/include/nagios4/ with their "counterpart" in nagios/includes/
Replace the headers in mod_gearman/include/nagios4/lib/ with their "counterpart" in nagios/lib/
run ./configure with --disable-naemon-neb-module --disable-nagios3-neb-module (and any other parameter you need/want) for mod_gearman
And the continue with the build/install of mod_gearman as described in INSTALL/README.
Thanks for sharing @danjoh!
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
Locked