host check orphaned issue

bosecorp · Post by **bosecorp** » Fri Apr 22, 2016 11:50 am

Hi,

After restarting Nagios service we saw this message for almost all the nodes

(host check orphaned, is the mod-gearman worker on queue 'hostgroup_gearman_dcn3' running?)

Below is what we see in gearman_top

Code: Select all

2016-04-22 12:49:58  -  10.100.30.113:4730  -  v1.1.8

 Queue Name                | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------
 check_results             |               4  |           0  |           0
 eventhandler              |              52  |           0  |           0
 host                      |              52  |           0  |           0
 hostgroup_gearman_dca1    |               1  |           0  |           0
 hostgroup_gearman_dce1    |              50  |           0  |           0
 hostgroup_gearman_dcn1    |              44  |           0  |           1
 hostgroup_gearman_dcn2    |              52  |           0  |           0
 hostgroup_gearman_dcn3    |              24  |           0  |           1
 hostgroup_gearman_hk1     |               1  |           7  |           1
 hostgroup_gearman_mi1     |               1  |           0  |           0
 hostgroup_gearman_my1     |               1  |           0  |           0
 hostgroup_gearman_sl1     |              24  |           0  |           0
 hostgroup_gearman_tj1     |              24  |           0  |           0
 hosts                     |               1  |           0  |           0
 service                   |              52  |           0  |           0
 servicegroup_gearman_mrtg |              52  |           0  |           0
 worker_gearmandce1        |               1  |           0  |           0
 worker_gearmandcn1        |               1  |           0  |           0
 worker_gearmandcn2        |               1  |           0  |           0
 worker_gearmandcn3        |               1  |           0  |           0
----------------------------------------------------------------------------

Nagios version : 5.2.7

Can someone assist us here?

tmcdonald · Post by **tmcdonald** » Fri Apr 22, 2016 1:55 pm

How long does it stay like that? If you restart in a middle of a check and the worker hasn't sent a result back, that could explain the message.

bosecorp · Post by **bosecorp** » Mon Apr 25, 2016 8:33 am

Nagios deamon has been crashed multiple times past week, and we would like to understand the root cause for the same.

Below are some analysis which i have done, i can see that we messages related to "nagiosramdisk"

[Sat Apr 23 00:00:56 2016] Error: my_fcopy() failed to write to '/var/nagiosramdisk/status.dat': No space left on device
[Sat Apr 23 00:00:56 2016] Error: Unable to rename file '/usr/local/nagios/var/nagios.tmp37GkTh' to '/var/nagiosramdisk/status.dat': No space left on device
[Sat Apr 23 00:00:56 2016] Error: Unable to update status data file '/var/nagiosramdisk/status.dat': No space left on device

Also we can see too many messages related to worker process getting timed out.

[Sat Apr 23 19:27:57 2016] Warning: Host performance data file processing command '/bin/mv /var/nagiosramdisk/host-perfdata /var/nagiosramdisk/spool/xidpe/1461454059.perfdata.host' timed out after 5 seconds
[Sat Apr 23 19:27:57 2016] wproc: Core Worker 25361: job 1051 (pid=20693) timed out. Killing it
[Sat Apr 23 19:27:57 2016] wproc: Core Worker 25364: job 1047 (pid=20679) timed out. Killing it

grep Killing nagios-04-24-2016-00.log | wc -l
8048

I'm attaching the logs for the same, let me know what other information you need.
Can you please handle this request as high priority.

Thanks.

tmcdonald · Post by **tmcdonald** » Mon Apr 25, 2016 9:32 am

bosecorp wrote:[Sat Apr 23 00:00:56 2016] Error: my_fcopy() failed to write to '/var/nagiosramdisk/status.dat': No space left on device
[Sat Apr 23 00:00:56 2016] Error: Unable to rename file '/usr/local/nagios/var/nagios.tmp37GkTh' to '/var/nagiosramdisk/status.dat': No space left on device
[Sat Apr 23 00:00:56 2016] Error: Unable to update status data file '/var/nagiosramdisk/status.dat': No space left on device

I think this pretty clearly points to your ramdisk filling up. How big is it? I normally tell people to double it if they are seeing frequent fill-ups, especially if they are planning on also expanding their system.

Can you please also address my question from earlier?

tmcdonald wrote:How long does it stay like that? If you restart in a middle of a check and the worker hasn't sent a result back, that could explain the message.

bosecorp · Post by **bosecorp** » Mon Apr 25, 2016 10:58 am

Regarding filling up /var/nagiosramdisk , It’s virtual disk and it takes it from RAM. The reason is because by doing that you improve IO performance, instead of writing to disk, to write to Memory.
This is something Nagios support team implemented few months ago.

The question here, why we ran out space.
The files that get created in these virtual disk supposed to get deleted after sometime. Can you please help us understand why we ran into this scenario.

Let me know if you need any other information from our end.
Below is the current utilization.

# df -lh
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rootvg-lvroot 2.0G 1.1G 808M 58% /
tmpfs 15G 0 15G 0% /dev/shm
/dev/sda1 243M 49M 181M 22% /boot
/dev/mapper/rootvg-lvopt 2.0G 92M 1.8G 5% /opt
/dev/mapper/rootvg-lvtmp 6.9G 249M 6.3G 4% /tmp
/dev/mapper/rootvg-lvusers 4.0G 137M 3.7G 4% /users
/dev/mapper/rootvg-lvusr 7.9G 6.2G 1.4G 82% /usr
/dev/mapper/rootvg-lvvar 30G 12G 17G 43% /var
/dev/mapper/vgapp-lvapp 49G 4.1G 42G 9% /app
/dev/mapper/vgapp-lvstore 69G 37G 30G 56% /store
/dev/mapper/vgapp-lvlocalnagios 128G 91G 31G 75% /usr/local/nagios
/dev/mapper/vgapp-lvmysql 69G 2.6G 63G 4% /var/lib/mysql
/dev/mapper/vgapp-lvmodgearlog 20G 173M 19G 1% /var/log/mod_gearman
/dev/mapper/vgapp-lvgearlog 20G 174M 19G 1% /var/log/gearmand
tmpfs 2.0G 52M 2.0G 3% /var/nagiosramdisk

bosecorp · Post by **bosecorp** » Mon Apr 25, 2016 12:51 pm

Again we are facing the same issue...

status info : host check orphaned, is the mod-gearman worker on queue 'hostgroup_gearman_dcn1' running?)

This is for most of the gearmans

Below is the current output for gearman_top

2016-04-25 13:47:54 - 10.100.30.113:4730 - v1.1.8

Queue Name | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------
check_results | 4 | 0 | 1
eventhandler | 26 | 0 | 0
host | 26 | 0 | 0
hostgroup_gearman_dca1 | 1 | 0 | 0
hostgroup_gearman_dce1 | 32 | 0 | 24
hostgroup_gearman_dcn1 | 29 | 0 | 16
hostgroup_gearman_dcn2 | 26 | 0 | 2
hostgroup_gearman_dcn3 | 23 | 0 | 0
hostgroup_gearman_hk1 | 1 | 0 | 0
hostgroup_gearman_mi1 | 1 | 0 | 0
hostgroup_gearman_my1 | 1 | 2 | 1
hostgroup_gearman_sl1 | 23 | 0 | 0
hostgroup_gearman_tj1 | 23 | 0 | 0
hosts | 1 | 0 | 0
service | 26 | 0 | 0
servicegroup_gearman_mrtg | 26 | 0 | 0
worker_gearmandce1 | 1 | 0 | 0
worker_gearmandcn1 | 1 | 0 | 0
worker_gearmandcn2 | 1 | 0 | 0
worker_gearmandcn3 | 0 | 0 | 0
----------------------------------------------------------------------------

# df -lh
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rootvg-lvroot 2.0G 1.1G 808M 58% /
tmpfs 15G 0 15G 0% /dev/shm
/dev/sda1 243M 49M 181M 22% /boot
/dev/mapper/rootvg-lvopt 2.0G 92M 1.8G 5% /opt
/dev/mapper/rootvg-lvtmp 6.9G 249M 6.3G 4% /tmp
/dev/mapper/rootvg-lvusers 4.0G 137M 3.7G 4% /users
/dev/mapper/rootvg-lvusr 7.9G 6.2G 1.4G 83% /usr
/dev/mapper/rootvg-lvvar 30G 12G 17G 43% /var
/dev/mapper/vgapp-lvapp 49G 4.1G 42G 9% /app
/dev/mapper/vgapp-lvstore 69G 37G 30G 56% /store
/dev/mapper/vgapp-lvlocalnagios 128G 91G 31G 75% /usr/local/nagios
/dev/mapper/vgapp-lvmysql 69G 2.6G 63G 4% /var/lib/mysql
/dev/mapper/vgapp-lvmodgearlog 20G 173M 19G 1% /var/log/mod_gearman
/dev/mapper/vgapp-lvgearlog 20G 174M 19G 1% /var/log/gearmand
tmpfs 2.0G 53M 2.0G 3% /var/nagiosramdisk

I would need someone to do a live troubleshooting and gather the required logs.

bosecorp · Post by **bosecorp** » Mon Apr 25, 2016 3:25 pm

We did got someone from Nagios phone support, and have tracked the root cause for this issue.

We will be monitoring our server for few days now.

Post by **tgriep** » Mon Apr 25, 2016 3:41 pm

Let me know how it works for you.

bosecorp · Post by **bosecorp** » Tue Apr 26, 2016 1:44 pm

As per our telephonic conversation yesterday with Tom Griep, he recommended us to implement the below changes.

Replace below

killproc_nagios ()
{
kill -s "$1" $NagiosPID
}

To killall -9 nagios

But after implementing this I tried to restart the Nagios service and I observed that it was not starting.

Before implementing the change

# /etc/init.d/nagios restart
Running configuration check...
Stopping nagios:. done.
Starting nagios: done.

After the change

# /etc/init.d/nagios restart
Running configuration check...
Stopping nagios:Killed

Can you please let us know the correct way to restart the service.
I have already dropped and email to him.

Post by **tgriep** » Tue Apr 26, 2016 4:46 pm

I would put back the following line in the /etc/init.d/nagios back until we figure out a better way to do it.

Code: Select all

kill -s "$1" $NagiosPID

I am thinking that the timeout settings we did in that config file will resolve the issue.

Nagios Support Forum

host check orphaned issue

host check orphaned issue

Re: host check orphaned issue

Re: host check orphaned issue

Re: host check orphaned issue

Re: host check orphaned issue

Re: host check orphaned issue

Re: host check orphaned issue

Re: host check orphaned issue

Re: host check orphaned issue

Re: host check orphaned issue