repeated timed out error in the nagios log file

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
ducle99
Posts: 11
Joined: Tue Jan 20, 2015 7:26 pm

repeated timed out error in the nagios log file

Post by ducle99 »

There're dozens of repeated messages in nagios log file as follows:

Code: Select all

[1423698292] wproc: Core Worker 2838: job 11 (pid=3084) timed out. Killing it
[1423698292] wproc: Core Worker 2838: job 11 with pid 3084 reaped at timeout. timeouts=11; started=12
[1423698293] wproc: Core Worker 2839: job 11 (pid=3086) timed out. Killing it
[1423698293] wproc: Core Worker 2839: job 11 with pid 3086 reaped at timeout. timeouts=11; started=12
I don't know what the issue with nagios configuration. Could you please provide some help?
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: repeated timed out error in the nagios log file

Post by tgriep »

We need to see the nagios.log from the point where nagios is restarted. Run the following on one terminal:

Code: Select all

tail -f /usr/local/nagios/var/nagios.log > /tmp/nagios_log.txt

Then, on another terminal, restart nagios:

Code: Select all

service nagios restart
Wait a few minutes, kill the tail with ctrl-c, and then post the contents of /tmp/nagios_log.txt

What version of Nagios Core are you running?
Be sure to check out our Knowledgebase for helpful articles and solutions!
iscb2015
Posts: 13
Joined: Fri Aug 28, 2015 2:35 am

Re: repeated timed out error in the nagios log file

Post by iscb2015 »

I've just implemented nagios 4.1.1 and am having a similar issue .. extracts from nagios log below ...

Code: Select all

[1445329430] Nagios 4.1.1 starting... (PID=29152)
[1445329430] Local time is Tue Oct 20 09:23:50 BST 2015
[1445329430] LOG VERSION: 2.0
[1445329430] qh: Socket '/usr/local/nagios/var/rw/nagios.qh' successfully initialized
[1445329430] qh: core query handler registered
[1445329430] nerd: Channel hostchecks registered successfully
[1445329430] nerd: Channel servicechecks registered successfully
[1445329430] nerd: Channel opathchecks registered successfully
[1445329430] nerd: Fully initialized and ready to rock!
[1445329430] wproc: Successfully registered manager as @wproc with query handler
[1445329430] wproc: Registry request: name=Core Worker 29153;pid=29153
[1445329430] wproc: Registry request: name=Core Worker 29154;pid=29154
[1445329430] wproc: Registry request: name=Core Worker 29155;pid=29155
[1445329430] wproc: Registry request: name=Core Worker 29156;pid=29156
[1445329430] wproc: Registry request: name=Core Worker 29157;pid=29157
[1445329430] wproc: Registry request: name=Core Worker 29195;pid=29195
[1445329430] wproc: Registry request: name=Core Worker 29196;pid=29196
[1445329430] wproc: Registry request: name=Core Worker 29197;pid=29197
[1445329430] wproc: Registry request: name=Core Worker 29200;pid=29200
[1445329432] WARNING: Extinfo objects are deprecated and will be removed in future versions
[1445329432] WARNING: Extinfo objects are deprecated and will be removed in future versions
[1445329432] WARNING: Extinfo objects are deprecated and will be removed in future versions
==== repeated Extinfo messages removed

Code: Select all

[1445329434] Successfully launched command file worker with pid 29202
[1445329469] wproc: Core Worker 29153: job 1 (pid=29220) timed out. Killing it
[1445329469] wproc: Core Worker 29153: job 1 with pid 29220 reaped at timeout. timeouts=1; started=8
[1445329470] wproc: Core Worker 29155: job 1 (pid=29223) timed out. Killing it
[1445329470] wproc: Core Worker 29155: job 1 with pid 29223 reaped at timeout. timeouts=1; started=8
[1445329471] wproc: Core Worker 29195: job 1 (pid=29228) timed out. Killing it
[1445329471] wproc: Core Worker 29195: job 1 with pid 29228 reaped at timeout. timeouts=1; started=8
[1445329472] wproc: Core Worker 29197: job 1 (pid=29231) timed out. Killing it
[1445329472] wproc: Core Worker 29197: job 1 with pid 29231 reaped at timeout. timeouts=1; started=8
[1445329473] wproc: Core Worker 29200: job 1 (pid=29233) timed out. Killing it
[1445329473] wproc: Core Worker 29200: job 1 with pid 29233 reaped at timeout. timeouts=1; started=8
[1445329474] wproc: Core Worker 29153: job 2 (pid=29236) timed out. Killing it
[1445329474] wproc: Core Worker 29153: job 2 with pid 29236 reaped at timeout. timeouts=2; started=9
[1445329475] wproc: Core Worker 29154: job 2 (pid=29238) timed out. Killing it
==== many thousands of lines similar to the above ....

There are also a large number of <defunct> processes on the server.

Any suggestions ?

Ian

----------

In trying to diagnose what is generating the timeout messages I have been looking at the status.dat file - and I have noticed that the execution time for the checks is (generally) just over 60 seconds for services and just over 30 seconds for hosts. Both of these seem to link to the timeout values in the nagios configuration.

Code: Select all

$ egrep "check_execution_time=60|check_execution_time=30" status.dat | wc -l
    1662
$ egrep "check_execution_time" status.dat | wc -l
    1703
Interestingly the web interface is reporting the correct information - I'm just worried that there is something else going on which will break at some point in the future.

If I run a check_ping manually it usually responds within a couple of seconds so I'm confused about why there are all these timeout / killed messages .....

Ian

----------

Does this relate to the issue described at https://tracker.nagios.org/view.php?id=498 Later in the thread it has the same messages as I am seeing and reports of a large number of defunct processes ....

Any help would be appreciated ...

Ian

----------

The apparent solution documented in the bug tracker (of configuring iobroker to epoll) is not available to me ...

./configure[6515]: test: 1: unknown operator
"epoll" is not available as an iobroker method.
Please use one of the other options.

I should have said that I'm running nagios on Solaris 11.2 (SPARC).

Ian
Last edited by tmcdonald on Thu Oct 22, 2015 3:13 pm, edited 2 times in total.
Reason: Please edit your last post instead of posting multiple times in a row. I have combined your 4 posts into 1.
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: repeated timed out error in the nagios log file

Post by Box293 »

Can you please post your nagios.cfg file.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
iscb2015
Posts: 13
Joined: Fri Aug 28, 2015 2:35 am

Re: repeated timed out error in the nagios log file

Post by iscb2015 »

nagios.cfg as requested (minus all the comments for brevity)
Ian

Code: Select all

# grep -v "^#" nagios.cfg | grep -v "^$"
log_file=/usr/local/nagios/var/nagios.log
cfg_dir=/usr/local/nagios/etc/MCL_config
object_cache_file=/usr/local/nagios/var/objects.cache
precached_object_file=/usr/local/nagios/var/objects.precache
resource_file=/usr/local/nagios/etc/resource.cfg
status_file=/usr/local/nagios/var/status.dat
status_update_interval=10
nagios_user=nagios
nagios_group=nagios
check_external_commands=1
command_file=/usr/local/nagios/var/rw/nagios.cmd
lock_file=/usr/local/nagios/var/nagios.lock
temp_file=/usr/local/nagios/var/nagios.tmp
temp_path=/tmp
event_broker_options=-1
log_rotation_method=d
log_archive_path=/usr/local/nagios/var/archives
use_syslog=1
log_notifications=1
log_service_retries=1
log_host_retries=1
log_event_handlers=1
log_initial_states=0
log_current_states=1
log_external_commands=1
log_passive_checks=1
service_inter_check_delay_method=s
max_service_check_spread=30
service_interleave_factor=s
host_inter_check_delay_method=s
max_host_check_spread=30
max_concurrent_checks=0
check_result_reaper_frequency=10
max_check_result_reaper_time=30
check_result_path=/usr/local/nagios/var/spool/checkresults
max_check_result_file_age=3600
cached_host_check_horizon=15
cached_service_check_horizon=15
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1
soft_state_dependencies=0
auto_reschedule_checks=0
auto_rescheduling_interval=30
auto_rescheduling_window=180
service_check_timeout=60
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5
retain_state_information=1
state_retention_file=/usr/local/nagios/var/retention.dat
retention_update_interval=60
use_retained_program_state=1
use_retained_scheduling_info=1
retained_host_attribute_mask=0
retained_service_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
interval_length=60
check_for_updates=1
bare_update_check=0
use_aggressive_host_checking=0
execute_service_checks=1
accept_passive_service_checks=1
execute_host_checks=1
accept_passive_host_checks=1
enable_notifications=1
enable_event_handlers=1
process_performance_data=0
obsess_over_services=0
obsess_over_hosts=0
translate_passive_host_checks=0
passive_host_checks_are_soft=0
check_for_orphaned_services=1
check_for_orphaned_hosts=1
check_service_freshness=1
service_freshness_check_interval=60
service_check_timeout_state=c
check_host_freshness=0
host_freshness_check_interval=60
additional_freshness_latency=15
enable_flap_detection=1
low_service_flap_threshold=5.0
high_service_flap_threshold=20.0
low_host_flap_threshold=5.0
high_host_flap_threshold=20.0
date_format=euro
illegal_object_name_chars=`~!$%^&*|'"<>?,()=
illegal_macro_output_chars=`~$&|'"<>
use_regexp_matching=0
use_true_regexp_matching=0
admin_email=nagios@localhost
admin_pager=pagenagios@localhost
daemon_dumps_core=0
use_large_installation_tweaks=0
enable_environment_macros=0
child_processes_fork_twice=0
debug_level=0
debug_verbosity=1
debug_file=/usr/local/nagios/var/nagios.debug
max_debug_file_size=1000000
allow_empty_hostgroup_assignment=0
#
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: repeated timed out error in the nagios log file

Post by jolson »

Code: Select all

[1445329434] Successfully launched command file worker with pid 29202
[1445329469] wproc: Core Worker 29153: job 1 (pid=29220) timed out. Killing it
Is there any chance that multiple Nagios processes are running? A Nagios process is incapable of reaping children that aren't its own.

Check for multiple processes:

Code: Select all

ps -ef | grep /bin/nag
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
iscb2015
Posts: 13
Joined: Fri Aug 28, 2015 2:35 am

Re: repeated timed out error in the nagios log file

Post by iscb2015 »

Unfortunately not ... There are multiple worker processes - only one nagios (and child) process though.

Ian

Code: Select all

# ps -ef | grep /bin/nag
  nagios 14222 14178   3 15:08:42 ?         976:42 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
  nagios 14224 14178   3 15:08:42 ?         976:49 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
  nagios 14183 14178   3 15:08:42 ?         976:47 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
    root  4137  3697   0 07:25:45 pts/1       0:00 grep /bin/nag
  nagios 14226 14178   3 15:08:42 ?         976:46 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
  nagios 14179 14178   3 15:08:42 ?         976:52 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
  nagios 14178     1   0 15:08:42 ?          22:40 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
  nagios 14181 14178   3 15:08:42 ?         976:51 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
  nagios 14221 14178   3 15:08:42 ?         976:50 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
  nagios 14180 14178   3 15:08:42 ?         976:48 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
  nagios 14182 14178   3 15:08:42 ?         976:46 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
  nagios 14227 14178   0 15:08:45 ?           0:06 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
#
----------

I tried building nagios with the configure option --with-iobroker=select which did remove the following type of messages from the log

Code: Select all

[1445496024] wproc: Core Worker 14222: job 36693 with pid 13630 reaped at timeout. timeouts=36696; started=36727
[1445496024] wproc: Core Worker 14226: job 36705 (pid=13850) timed out. Killing it
Unfortunately though no checks were being performed and the following messages were seen in the log instead (looks like a similar message for each check attempted)

Code: Select all

[1445496081] Successfully launched command file worker with pid 14443
[1445496344] wproc: 'Core Worker 14398' seems to be choked. ret = -1; bufsize = 107: errno = 11 (Resource temporarily unavailable)
[1445496344] Unable to run check for service 'Disk space var' on host 'HOSTNAME1'
[1445496345] wproc: 'Core Worker 14398' seems to be choked. ret = -1; bufsize = 107: errno = 11 (Resource temporarily unavailable)
[1445496345] Unable to run check for service 'Disk space var' on host 'HOSTNAME2'
Does anyone have any suggestions as to the potential issue here ?

Ian

----------

I have just tried to build the software with the configure option --disable-event-broker - however the build fails with undefined symbols (eg neb_free_module_list)

Ian
Last edited by tmcdonald on Thu Oct 22, 2015 3:16 pm, edited 1 time in total.
Reason: Please edit your previous post instead of posting multiple times in a row. I have combined your 3 posts into 1.
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: repeated timed out error in the nagios log file

Post by tmcdonald »

Can you send the specific lines from the build when it fails? 20 or so lines should be good.
Former Nagios employee
iscb2015
Posts: 13
Joined: Fri Aug 28, 2015 2:35 am

Re: repeated timed out error in the nagios log file

Post by iscb2015 »

See below for the requested lines. I'm not even sure if this is the right direction for finding a resolution for the bug .. was there anything in the nagios.cfg that could cause the issue ?

Cheers

Ian

gmake -C ../lib
gmake[2]: Entering directory `/var/tmp/ian/nagios-4.1.1/lib'
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c squeue.c -o squeue.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c kvvec.c -o kvvec.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c iocache.c -o iocache.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c iobroker.c -o iobroker.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c bitmap.c -o bitmap.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c dkhash.c -o dkhash.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c runcmd.c -o runcmd.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c nsutils.c -o nsutils.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c fanout.c -o fanout.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c pqueue.c -o pqueue.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c worker.c -o worker.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c skiplist.c -o skiplist.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c nsock.c -o nsock.o
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -DHAVE_CONFIG_H -c nspath.c -o nspath.o
ar cr libnagios.a squeue.o kvvec.o iocache.o iobroker.o bitmap.o dkhash.o runcmd.o nsutils.o fanout.o pqueue.o worker.o skiplist.o nsock.o nspath.o
gmake[2]: Leaving directory `/var/tmp/ian/nagios-4.1.1/lib'
gcc -Wall -I.. -I /usr/include -I /usr/include/gd2 -DHAVE_CONFIG_H -DNSCORE -o nagios nagios.o ../common/shared.o nerd.o query-handler.o workers.o checks.o config.o commands.o events.o flapping.o logging.o macros-base.o netutils.o notifications.o sehandlers.o utils.o retention-base.o xretention-base.o comments-base.o xcomments-base.o objects-base.o xobjects-base.o statusdata-base.o xstatusdata-base.o perfdata-base.o xperfdata-base.o downtime-base.o -lm -lsocket -lsocket ../lib/libnagios.a
Undefined first referenced
symbol in file
neb_free_module_list nagios.o
neb_register_callback nerd.o
neb_deregister_callback nerd.o
neb_add_core_module nerd.o
ld: fatal: symbol referencing errors
collect2: error: ld returned 1 exit status
gmake[1]: *** [nagios] Error 1
gmake[1]: Leaving directory `/var/tmp/ian/nagios-4.1.1/base'
gmake: *** [all] Error 2
iscb2015
Posts: 13
Joined: Fri Aug 28, 2015 2:35 am

Re: repeated timed out error in the nagios log file

Post by iscb2015 »

Hi

I have raised a report at https://github.com/NagiosEnterprises/nagioscore/issues (number 84) for this issue - I presume that this will help in having somewhere to log time.

Cheers

Ian
Locked