Page 1 of 2

repeated timed out error in the nagios log file

Posted: Wed Feb 11, 2015 8:02 pm
by ducle99
There're dozens of repeated messages in nagios log file as follows:

Code: Select all

[1423698292] wproc: Core Worker 2838: job 11 (pid=3084) timed out. Killing it
[1423698292] wproc: Core Worker 2838: job 11 with pid 3084 reaped at timeout. timeouts=11; started=12
[1423698293] wproc: Core Worker 2839: job 11 (pid=3086) timed out. Killing it
[1423698293] wproc: Core Worker 2839: job 11 with pid 3086 reaped at timeout. timeouts=11; started=12
I don't know what the issue with nagios configuration. Could you please provide some help?

Re: repeated timed out error in the nagios log file

Posted: Thu Feb 12, 2015 2:10 pm
by tgriep
We need to see the nagios.log from the point where nagios is restarted. Run the following on one terminal:

Code: Select all

tail -f /usr/local/nagios/var/nagios.log > /tmp/nagios_log.txt

Then, on another terminal, restart nagios:

Code: Select all

service nagios restart
Wait a few minutes, kill the tail with ctrl-c, and then post the contents of /tmp/nagios_log.txt

What version of Nagios Core are you running?

Re: repeated timed out error in the nagios log file

Posted: Tue Oct 20, 2015 6:45 am
by iscb2015
I've just implemented nagios 4.1.1 and am having a similar issue .. extracts from nagios log below ...

Code: Select all

[1445329430] Nagios 4.1.1 starting... (PID=29152)
[1445329430] Local time is Tue Oct 20 09:23:50 BST 2015
[1445329430] LOG VERSION: 2.0
[1445329430] qh: Socket '/usr/local/nagios/var/rw/nagios.qh' successfully initialized
[1445329430] qh: core query handler registered
[1445329430] nerd: Channel hostchecks registered successfully
[1445329430] nerd: Channel servicechecks registered successfully
[1445329430] nerd: Channel opathchecks registered successfully
[1445329430] nerd: Fully initialized and ready to rock!
[1445329430] wproc: Successfully registered manager as @wproc with query handler
[1445329430] wproc: Registry request: name=Core Worker 29153;pid=29153
[1445329430] wproc: Registry request: name=Core Worker 29154;pid=29154
[1445329430] wproc: Registry request: name=Core Worker 29155;pid=29155
[1445329430] wproc: Registry request: name=Core Worker 29156;pid=29156
[1445329430] wproc: Registry request: name=Core Worker 29157;pid=29157
[1445329430] wproc: Registry request: name=Core Worker 29195;pid=29195
[1445329430] wproc: Registry request: name=Core Worker 29196;pid=29196
[1445329430] wproc: Registry request: name=Core Worker 29197;pid=29197
[1445329430] wproc: Registry request: name=Core Worker 29200;pid=29200
[1445329432] WARNING: Extinfo objects are deprecated and will be removed in future versions
[1445329432] WARNING: Extinfo objects are deprecated and will be removed in future versions
[1445329432] WARNING: Extinfo objects are deprecated and will be removed in future versions
==== repeated Extinfo messages removed

Code: Select all

[1445329434] Successfully launched command file worker with pid 29202
[1445329469] wproc: Core Worker 29153: job 1 (pid=29220) timed out. Killing it
[1445329469] wproc: Core Worker 29153: job 1 with pid 29220 reaped at timeout. timeouts=1; started=8
[1445329470] wproc: Core Worker 29155: job 1 (pid=29223) timed out. Killing it
[1445329470] wproc: Core Worker 29155: job 1 with pid 29223 reaped at timeout. timeouts=1; started=8
[1445329471] wproc: Core Worker 29195: job 1 (pid=29228) timed out. Killing it
[1445329471] wproc: Core Worker 29195: job 1 with pid 29228 reaped at timeout. timeouts=1; started=8
[1445329472] wproc: Core Worker 29197: job 1 (pid=29231) timed out. Killing it
[1445329472] wproc: Core Worker 29197: job 1 with pid 29231 reaped at timeout. timeouts=1; started=8
[1445329473] wproc: Core Worker 29200: job 1 (pid=29233) timed out. Killing it
[1445329473] wproc: Core Worker 29200: job 1 with pid 29233 reaped at timeout. timeouts=1; started=8
[1445329474] wproc: Core Worker 29153: job 2 (pid=29236) timed out. Killing it
[1445329474] wproc: Core Worker 29153: job 2 with pid 29236 reaped at timeout. timeouts=2; started=9
[1445329475] wproc: Core Worker 29154: job 2 (pid=29238) timed out. Killing it
==== many thousands of lines similar to the above ....

There are also a large number of <defunct> processes on the server.

Any suggestions ?

Ian

----------

In trying to diagnose what is generating the timeout messages I have been looking at the status.dat file - and I have noticed that the execution time for the checks is (generally) just over 60 seconds for services and just over 30 seconds for hosts. Both of these seem to link to the timeout values in the nagios configuration.

Code: Select all

$ egrep "check_execution_time=60|check_execution_time=30" status.dat | wc -l
    1662
$ egrep "check_execution_time" status.dat | wc -l
    1703
Interestingly the web interface is reporting the correct information - I'm just worried that there is something else going on which will break at some point in the future.

If I run a check_ping manually it usually responds within a couple of seconds so I'm confused about why there are all these timeout / killed messages .....

Ian

----------

Does this relate to the issue described at https://tracker.nagios.org/view.php?id=498 Later in the thread it has the same messages as I am seeing and reports of a large number of defunct processes ....

Any help would be appreciated ...

Ian

----------

The apparent solution documented in the bug tracker (of configuring iobroker to epoll) is not available to me ...

./configure[6515]: test: 1: unknown operator
"epoll" is not available as an iobroker method.
Please use one of the other options.

I should have said that I'm running nagios on Solaris 11.2 (SPARC).

Ian

Re: repeated timed out error in the nagios log file

Posted: Tue Oct 20, 2015 4:28 pm
by Box293
Can you please post your nagios.cfg file.

Re: repeated timed out error in the nagios log file

Posted: Wed Oct 21, 2015 2:21 am
by iscb2015
nagios.cfg as requested (minus all the comments for brevity)
Ian

Code: Select all

# grep -v "^#" nagios.cfg | grep -v "^$"
log_file=/usr/local/nagios/var/nagios.log
cfg_dir=/usr/local/nagios/etc/MCL_config
object_cache_file=/usr/local/nagios/var/objects.cache
precached_object_file=/usr/local/nagios/var/objects.precache
resource_file=/usr/local/nagios/etc/resource.cfg
status_file=/usr/local/nagios/var/status.dat
status_update_interval=10
nagios_user=nagios
nagios_group=nagios
check_external_commands=1
command_file=/usr/local/nagios/var/rw/nagios.cmd
lock_file=/usr/local/nagios/var/nagios.lock
temp_file=/usr/local/nagios/var/nagios.tmp
temp_path=/tmp
event_broker_options=-1
log_rotation_method=d
log_archive_path=/usr/local/nagios/var/archives
use_syslog=1
log_notifications=1
log_service_retries=1
log_host_retries=1
log_event_handlers=1
log_initial_states=0
log_current_states=1
log_external_commands=1
log_passive_checks=1
service_inter_check_delay_method=s
max_service_check_spread=30
service_interleave_factor=s
host_inter_check_delay_method=s
max_host_check_spread=30
max_concurrent_checks=0
check_result_reaper_frequency=10
max_check_result_reaper_time=30
check_result_path=/usr/local/nagios/var/spool/checkresults
max_check_result_file_age=3600
cached_host_check_horizon=15
cached_service_check_horizon=15
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1
soft_state_dependencies=0
auto_reschedule_checks=0
auto_rescheduling_interval=30
auto_rescheduling_window=180
service_check_timeout=60
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5
retain_state_information=1
state_retention_file=/usr/local/nagios/var/retention.dat
retention_update_interval=60
use_retained_program_state=1
use_retained_scheduling_info=1
retained_host_attribute_mask=0
retained_service_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
interval_length=60
check_for_updates=1
bare_update_check=0
use_aggressive_host_checking=0
execute_service_checks=1
accept_passive_service_checks=1
execute_host_checks=1
accept_passive_host_checks=1
enable_notifications=1
enable_event_handlers=1
process_performance_data=0
obsess_over_services=0
obsess_over_hosts=0
translate_passive_host_checks=0
passive_host_checks_are_soft=0
check_for_orphaned_services=1
check_for_orphaned_hosts=1
check_service_freshness=1
service_freshness_check_interval=60
service_check_timeout_state=c
check_host_freshness=0
host_freshness_check_interval=60
additional_freshness_latency=15
enable_flap_detection=1
low_service_flap_threshold=5.0
high_service_flap_threshold=20.0
low_host_flap_threshold=5.0
high_host_flap_threshold=20.0
date_format=euro
illegal_object_name_chars=`~!$%^&*|'"<>?,()=
illegal_macro_output_chars=`~$&|'"<>
use_regexp_matching=0
use_true_regexp_matching=0
admin_email=nagios@localhost
admin_pager=pagenagios@localhost
daemon_dumps_core=0
use_large_installation_tweaks=0
enable_environment_macros=0
child_processes_fork_twice=0
debug_level=0
debug_verbosity=1
debug_file=/usr/local/nagios/var/nagios.debug
max_debug_file_size=1000000
allow_empty_hostgroup_assignment=0
#

Re: repeated timed out error in the nagios log file

Posted: Wed Oct 21, 2015 5:07 pm
by jolson

Code: Select all

[1445329434] Successfully launched command file worker with pid 29202
[1445329469] wproc: Core Worker 29153: job 1 (pid=29220) timed out. Killing it
Is there any chance that multiple Nagios processes are running? A Nagios process is incapable of reaping children that aren't its own.

Check for multiple processes:

Code: Select all

ps -ef | grep /bin/nag

Re: repeated timed out error in the nagios log file

Posted: Thu Oct 22, 2015 1:27 am
by iscb2015
Unfortunately not ... There are multiple worker processes - only one nagios (and child) process though.

Ian

Code: Select all

# ps -ef | grep /bin/nag
  nagios 14222 14178   3 15:08:42 ?         976:42 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
  nagios 14224 14178   3 15:08:42 ?         976:49 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
  nagios 14183 14178   3 15:08:42 ?         976:47 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
    root  4137  3697   0 07:25:45 pts/1       0:00 grep /bin/nag
  nagios 14226 14178   3 15:08:42 ?         976:46 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
  nagios 14179 14178   3 15:08:42 ?         976:52 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
  nagios 14178     1   0 15:08:42 ?          22:40 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
  nagios 14181 14178   3 15:08:42 ?         976:51 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
  nagios 14221 14178   3 15:08:42 ?         976:50 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
  nagios 14180 14178   3 15:08:42 ?         976:48 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
  nagios 14182 14178   3 15:08:42 ?         976:46 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
  nagios 14227 14178   0 15:08:45 ?           0:06 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
#
----------

I tried building nagios with the configure option --with-iobroker=select which did remove the following type of messages from the log

Code: Select all

[1445496024] wproc: Core Worker 14222: job 36693 with pid 13630 reaped at timeout. timeouts=36696; started=36727
[1445496024] wproc: Core Worker 14226: job 36705 (pid=13850) timed out. Killing it
Unfortunately though no checks were being performed and the following messages were seen in the log instead (looks like a similar message for each check attempted)

Code: Select all

[1445496081] Successfully launched command file worker with pid 14443
[1445496344] wproc: 'Core Worker 14398' seems to be choked. ret = -1; bufsize = 107: errno = 11 (Resource temporarily unavailable)
[1445496344] Unable to run check for service 'Disk space var' on host 'HOSTNAME1'
[1445496345] wproc: 'Core Worker 14398' seems to be choked. ret = -1; bufsize = 107: errno = 11 (Resource temporarily unavailable)
[1445496345] Unable to run check for service 'Disk space var' on host 'HOSTNAME2'
Does anyone have any suggestions as to the potential issue here ?

Ian

----------

I have just tried to build the software with the configure option --disable-event-broker - however the build fails with undefined symbols (eg neb_free_module_list)

Ian

Re: repeated timed out error in the nagios log file

Posted: Thu Oct 22, 2015 4:58 pm
by tmcdonald
Can you send the specific lines from the build when it fails? 20 or so lines should be good.

Re: repeated timed out error in the nagios log file

Posted: Fri Oct 23, 2015 1:30 am
by iscb2015
See below for the requested lines. I'm not even sure if this is the right direction for finding a resolution for the bug .. was there anything in the nagios.cfg that could cause the issue ?

Cheers

Ian

gmake -C ../lib
gmake[2]: Entering directory `/var/tmp/ian/nagios-4.1.1/lib'
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c squeue.c -o squeue.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c kvvec.c -o kvvec.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c iocache.c -o iocache.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c iobroker.c -o iobroker.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c bitmap.c -o bitmap.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c dkhash.c -o dkhash.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c runcmd.c -o runcmd.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c nsutils.c -o nsutils.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c fanout.c -o fanout.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c pqueue.c -o pqueue.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c worker.c -o worker.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c skiplist.c -o skiplist.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c nsock.c -o nsock.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c nspath.c -o nspath.o
ar cr libnagios.a squeue.o kvvec.o iocache.o iobroker.o bitmap.o dkhash.o runcmd.o nsutils.o fanout.o pqueue.o worker.o skiplist.o nsock.o nspath.o
gmake[2]: Leaving directory `/var/tmp/ian/nagios-4.1.1/lib'
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -o nagios nagios.o ../common/shared.o nerd.o query-handler.o workers.o checks.o config.o commands.o events.o flapping.o logging.o macros-base.o netutils.o notifications.o sehandlers.o utils.o retention-base.o xretention-base.o comments-base.o xcomments-base.o objects-base.o xobjects-base.o statusdata-base.o xstatusdata-base.o perfdata-base.o xperfdata-base.o downtime-base.o -lm -lsocket -lsocket ../lib/libnagios.a
Undefined first referenced
symbol in file
neb_free_module_list nagios.o
neb_register_callback nerd.o
neb_deregister_callback nerd.o
neb_add_core_module nerd.o
ld: fatal: symbol referencing errors
collect2: error: ld returned 1 exit status
gmake[1]: *** [nagios] Error 1
gmake[1]: Leaving directory `/var/tmp/ian/nagios-4.1.1/base'
gmake: *** [all] Error 2

Re: repeated timed out error in the nagios log file

Posted: Fri Oct 23, 2015 1:58 am
by iscb2015
Hi

I have raised a report at https://github.com/NagiosEnterprises/nagioscore/issues (number 84) for this issue - I presume that this will help in having somewhere to log time.

Cheers

Ian