host check orphaned issue

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
bosecorp
Posts: 929
Joined: Thu Jun 26, 2014 1:00 pm

host check orphaned issue

Post by bosecorp »

Hi,

After restarting Nagios service we saw this message for almost all the nodes

(host check orphaned, is the mod-gearman worker on queue 'hostgroup_gearman_dcn3' running?)

Below is what we see in gearman_top

Code: Select all

2016-04-22 12:49:58  -  10.100.30.113:4730  -  v1.1.8

 Queue Name                | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------
 check_results             |               4  |           0  |           0
 eventhandler              |              52  |           0  |           0
 host                      |              52  |           0  |           0
 hostgroup_gearman_dca1    |               1  |           0  |           0
 hostgroup_gearman_dce1    |              50  |           0  |           0
 hostgroup_gearman_dcn1    |              44  |           0  |           1
 hostgroup_gearman_dcn2    |              52  |           0  |           0
 hostgroup_gearman_dcn3    |              24  |           0  |           1
 hostgroup_gearman_hk1     |               1  |           7  |           1
 hostgroup_gearman_mi1     |               1  |           0  |           0
 hostgroup_gearman_my1     |               1  |           0  |           0
 hostgroup_gearman_sl1     |              24  |           0  |           0
 hostgroup_gearman_tj1     |              24  |           0  |           0
 hosts                     |               1  |           0  |           0
 service                   |              52  |           0  |           0
 servicegroup_gearman_mrtg |              52  |           0  |           0
 worker_gearmandce1        |               1  |           0  |           0
 worker_gearmandcn1        |               1  |           0  |           0
 worker_gearmandcn2        |               1  |           0  |           0
 worker_gearmandcn3        |               1  |           0  |           0
----------------------------------------------------------------------------
Nagios version : 5.2.7

Can someone assist us here?
Last edited by tmcdonald on Fri Apr 22, 2016 1:54 pm, edited 1 time in total.
Reason: Please use [code][/code] tags around long output
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: host check orphaned issue

Post by tmcdonald »

How long does it stay like that? If you restart in a middle of a check and the worker hasn't sent a result back, that could explain the message.
Former Nagios employee
bosecorp
Posts: 929
Joined: Thu Jun 26, 2014 1:00 pm

Re: host check orphaned issue

Post by bosecorp »

Nagios deamon has been crashed multiple times past week, and we would like to understand the root cause for the same.

Below are some analysis which i have done, i can see that we messages related to "nagiosramdisk"

[Sat Apr 23 00:00:56 2016] Error: my_fcopy() failed to write to '/var/nagiosramdisk/status.dat': No space left on device
[Sat Apr 23 00:00:56 2016] Error: Unable to rename file '/usr/local/nagios/var/nagios.tmp37GkTh' to '/var/nagiosramdisk/status.dat': No space left on device
[Sat Apr 23 00:00:56 2016] Error: Unable to update status data file '/var/nagiosramdisk/status.dat': No space left on device

Also we can see too many messages related to worker process getting timed out.

[Sat Apr 23 19:27:57 2016] Warning: Host performance data file processing command '/bin/mv /var/nagiosramdisk/host-perfdata /var/nagiosramdisk/spool/xidpe/1461454059.perfdata.host' timed out after 5 seconds
[Sat Apr 23 19:27:57 2016] wproc: Core Worker 25361: job 1051 (pid=20693) timed out. Killing it
[Sat Apr 23 19:27:57 2016] wproc: Core Worker 25364: job 1047 (pid=20679) timed out. Killing it

grep Killing nagios-04-24-2016-00.log | wc -l
8048

I'm attaching the logs for the same, let me know what other information you need.
Can you please handle this request as high priority.

Thanks.
You do not have the required permissions to view the files attached to this post.
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: host check orphaned issue

Post by tmcdonald »

bosecorp wrote:[Sat Apr 23 00:00:56 2016] Error: my_fcopy() failed to write to '/var/nagiosramdisk/status.dat': No space left on device
[Sat Apr 23 00:00:56 2016] Error: Unable to rename file '/usr/local/nagios/var/nagios.tmp37GkTh' to '/var/nagiosramdisk/status.dat': No space left on device
[Sat Apr 23 00:00:56 2016] Error: Unable to update status data file '/var/nagiosramdisk/status.dat': No space left on device
I think this pretty clearly points to your ramdisk filling up. How big is it? I normally tell people to double it if they are seeing frequent fill-ups, especially if they are planning on also expanding their system.

Can you please also address my question from earlier?
tmcdonald wrote:How long does it stay like that? If you restart in a middle of a check and the worker hasn't sent a result back, that could explain the message.
Former Nagios employee
bosecorp
Posts: 929
Joined: Thu Jun 26, 2014 1:00 pm

Re: host check orphaned issue

Post by bosecorp »

Regarding filling up /var/nagiosramdisk , It’s virtual disk and it takes it from RAM. The reason is because by doing that you improve IO performance, instead of writing to disk, to write to Memory.
This is something Nagios support team implemented few months ago.

The question here, why we ran out space.
The files that get created in these virtual disk supposed to get deleted after sometime. Can you please help us understand why we ran into this scenario.

Let me know if you need any other information from our end.
Below is the current utilization.

# df -lh
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rootvg-lvroot 2.0G 1.1G 808M 58% /
tmpfs 15G 0 15G 0% /dev/shm
/dev/sda1 243M 49M 181M 22% /boot
/dev/mapper/rootvg-lvopt 2.0G 92M 1.8G 5% /opt
/dev/mapper/rootvg-lvtmp 6.9G 249M 6.3G 4% /tmp
/dev/mapper/rootvg-lvusers 4.0G 137M 3.7G 4% /users
/dev/mapper/rootvg-lvusr 7.9G 6.2G 1.4G 82% /usr
/dev/mapper/rootvg-lvvar 30G 12G 17G 43% /var
/dev/mapper/vgapp-lvapp 49G 4.1G 42G 9% /app
/dev/mapper/vgapp-lvstore 69G 37G 30G 56% /store
/dev/mapper/vgapp-lvlocalnagios 128G 91G 31G 75% /usr/local/nagios
/dev/mapper/vgapp-lvmysql 69G 2.6G 63G 4% /var/lib/mysql
/dev/mapper/vgapp-lvmodgearlog 20G 173M 19G 1% /var/log/mod_gearman
/dev/mapper/vgapp-lvgearlog 20G 174M 19G 1% /var/log/gearmand
tmpfs 2.0G 52M 2.0G 3% /var/nagiosramdisk
bosecorp
Posts: 929
Joined: Thu Jun 26, 2014 1:00 pm

Re: host check orphaned issue

Post by bosecorp »

Again we are facing the same issue...

status info : host check orphaned, is the mod-gearman worker on queue 'hostgroup_gearman_dcn1' running?)

This is for most of the gearmans

Below is the current output for gearman_top

2016-04-25 13:47:54 - 10.100.30.113:4730 - v1.1.8

Queue Name | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------
check_results | 4 | 0 | 1
eventhandler | 26 | 0 | 0
host | 26 | 0 | 0
hostgroup_gearman_dca1 | 1 | 0 | 0
hostgroup_gearman_dce1 | 32 | 0 | 24
hostgroup_gearman_dcn1 | 29 | 0 | 16
hostgroup_gearman_dcn2 | 26 | 0 | 2
hostgroup_gearman_dcn3 | 23 | 0 | 0
hostgroup_gearman_hk1 | 1 | 0 | 0
hostgroup_gearman_mi1 | 1 | 0 | 0
hostgroup_gearman_my1 | 1 | 2 | 1
hostgroup_gearman_sl1 | 23 | 0 | 0
hostgroup_gearman_tj1 | 23 | 0 | 0
hosts | 1 | 0 | 0
service | 26 | 0 | 0
servicegroup_gearman_mrtg | 26 | 0 | 0
worker_gearmandce1 | 1 | 0 | 0
worker_gearmandcn1 | 1 | 0 | 0
worker_gearmandcn2 | 1 | 0 | 0
worker_gearmandcn3 | 0 | 0 | 0
----------------------------------------------------------------------------

# df -lh
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rootvg-lvroot 2.0G 1.1G 808M 58% /
tmpfs 15G 0 15G 0% /dev/shm
/dev/sda1 243M 49M 181M 22% /boot
/dev/mapper/rootvg-lvopt 2.0G 92M 1.8G 5% /opt
/dev/mapper/rootvg-lvtmp 6.9G 249M 6.3G 4% /tmp
/dev/mapper/rootvg-lvusers 4.0G 137M 3.7G 4% /users
/dev/mapper/rootvg-lvusr 7.9G 6.2G 1.4G 83% /usr
/dev/mapper/rootvg-lvvar 30G 12G 17G 43% /var
/dev/mapper/vgapp-lvapp 49G 4.1G 42G 9% /app
/dev/mapper/vgapp-lvstore 69G 37G 30G 56% /store
/dev/mapper/vgapp-lvlocalnagios 128G 91G 31G 75% /usr/local/nagios
/dev/mapper/vgapp-lvmysql 69G 2.6G 63G 4% /var/lib/mysql
/dev/mapper/vgapp-lvmodgearlog 20G 173M 19G 1% /var/log/mod_gearman
/dev/mapper/vgapp-lvgearlog 20G 174M 19G 1% /var/log/gearmand
tmpfs 2.0G 53M 2.0G 3% /var/nagiosramdisk

I would need someone to do a live troubleshooting and gather the required logs.
bosecorp
Posts: 929
Joined: Thu Jun 26, 2014 1:00 pm

Re: host check orphaned issue

Post by bosecorp »

We did got someone from Nagios phone support, and have tracked the root cause for this issue.

We will be monitoring our server for few days now.
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: host check orphaned issue

Post by tgriep »

Let me know how it works for you.
Be sure to check out our Knowledgebase for helpful articles and solutions!
bosecorp
Posts: 929
Joined: Thu Jun 26, 2014 1:00 pm

Re: host check orphaned issue

Post by bosecorp »

As per our telephonic conversation yesterday with Tom Griep, he recommended us to implement the below changes.

Replace below

killproc_nagios ()
{
kill -s "$1" $NagiosPID
}

To killall -9 nagios

But after implementing this I tried to restart the Nagios service and I observed that it was not starting.

Before implementing the change

# /etc/init.d/nagios restart
Running configuration check...
Stopping nagios:. done.
Starting nagios: done.

After the change

# /etc/init.d/nagios restart
Running configuration check...
Stopping nagios:Killed

Can you please let us know the correct way to restart the service.
I have already dropped and email to him.
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: host check orphaned issue

Post by tgriep »

I would put back the following line in the /etc/init.d/nagios back until we figure out a better way to do it.

Code: Select all

kill -s "$1" $NagiosPID
I am thinking that the timeout settings we did in that config file will resolve the issue.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked