This support forum board is for support questions relating to
Nagios XI , our flagship commercial network monitoring solution.
bosecorp
Posts: 929 Joined: Thu Jun 26, 2014 1:00 pm
Post
by bosecorp » Fri Apr 22, 2016 11:50 am
Hi,
After restarting Nagios service we saw this message for almost all the nodes
(host check orphaned, is the mod-gearman worker on queue 'hostgroup_gearman_dcn3' running?)
Below is what we see in gearman_top
Code: Select all
2016-04-22 12:49:58 - 10.100.30.113:4730 - v1.1.8
Queue Name | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------
check_results | 4 | 0 | 0
eventhandler | 52 | 0 | 0
host | 52 | 0 | 0
hostgroup_gearman_dca1 | 1 | 0 | 0
hostgroup_gearman_dce1 | 50 | 0 | 0
hostgroup_gearman_dcn1 | 44 | 0 | 1
hostgroup_gearman_dcn2 | 52 | 0 | 0
hostgroup_gearman_dcn3 | 24 | 0 | 1
hostgroup_gearman_hk1 | 1 | 7 | 1
hostgroup_gearman_mi1 | 1 | 0 | 0
hostgroup_gearman_my1 | 1 | 0 | 0
hostgroup_gearman_sl1 | 24 | 0 | 0
hostgroup_gearman_tj1 | 24 | 0 | 0
hosts | 1 | 0 | 0
service | 52 | 0 | 0
servicegroup_gearman_mrtg | 52 | 0 | 0
worker_gearmandce1 | 1 | 0 | 0
worker_gearmandcn1 | 1 | 0 | 0
worker_gearmandcn2 | 1 | 0 | 0
worker_gearmandcn3 | 1 | 0 | 0
----------------------------------------------------------------------------
Nagios version : 5.2.7
Can someone assist us here?
Last edited by
tmcdonald on Fri Apr 22, 2016 1:54 pm, edited 1 time in total.
Reason: Please use [code][/code] tags around long output
tmcdonald
Posts: 9117 Joined: Mon Sep 23, 2013 8:40 am
Post
by tmcdonald » Fri Apr 22, 2016 1:55 pm
How long does it stay like that? If you restart in a middle of a check and the worker hasn't sent a result back, that could explain the message.
Former Nagios employee
bosecorp
Posts: 929 Joined: Thu Jun 26, 2014 1:00 pm
Post
by bosecorp » Mon Apr 25, 2016 8:33 am
Nagios deamon has been crashed multiple times past week, and we would like to understand the root cause for the same.
Below are some analysis which i have done, i can see that we messages related to "nagiosramdisk"
[Sat Apr 23 00:00:56 2016] Error: my_fcopy() failed to write to '/var/nagiosramdisk/status.dat': No space left on device
[Sat Apr 23 00:00:56 2016] Error: Unable to rename file '/usr/local/nagios/var/nagios.tmp37GkTh' to '/var/nagiosramdisk/status.dat': No space left on device
[Sat Apr 23 00:00:56 2016] Error: Unable to update status data file '/var/nagiosramdisk/status.dat': No space left on device
Also we can see too many messages related to worker process getting timed out.
[Sat Apr 23 19:27:57 2016] Warning: Host performance data file processing command '/bin/mv /var/nagiosramdisk/host-perfdata /var/nagiosramdisk/spool/xidpe/1461454059.perfdata.host' timed out after 5 seconds
[Sat Apr 23 19:27:57 2016] wproc: Core Worker 25361: job 1051 (pid=20693) timed out. Killing it
[Sat Apr 23 19:27:57 2016] wproc: Core Worker 25364: job 1047 (pid=20679) timed out. Killing it
grep Killing nagios-04-24-2016-00.log | wc -l
8048
I'm attaching the logs for the same, let me know what other information you need.
Can you please handle this request as high priority.
Thanks.
You do not have the required permissions to view the files attached to this post.
tmcdonald
Posts: 9117 Joined: Mon Sep 23, 2013 8:40 am
Post
by tmcdonald » Mon Apr 25, 2016 9:32 am
bosecorp wrote: [Sat Apr 23 00:00:56 2016] Error: my_fcopy() failed to write to '/var/nagiosramdisk/status.dat': No space left on device
[Sat Apr 23 00:00:56 2016] Error: Unable to rename file '/usr/local/nagios/var/nagios.tmp37GkTh' to '/var/nagiosramdisk/status.dat': No space left on device
[Sat Apr 23 00:00:56 2016] Error: Unable to update status data file '/var/nagiosramdisk/status.dat': No space left on device
I think this pretty clearly points to your ramdisk filling up. How big is it? I normally tell people to double it if they are seeing frequent fill-ups, especially if they are planning on also expanding their system.
Can you please also address my question from earlier?
tmcdonald wrote: How long does it stay like that? If you restart in a middle of a check and the worker hasn't sent a result back, that could explain the message.
Former Nagios employee
bosecorp
Posts: 929 Joined: Thu Jun 26, 2014 1:00 pm
Post
by bosecorp » Mon Apr 25, 2016 10:58 am
Regarding filling up /var/nagiosramdisk , It’s virtual disk and it takes it from RAM. The reason is because by doing that you improve IO performance, instead of writing to disk, to write to Memory.
This is something Nagios support team implemented few months ago.
The question here, why we ran out space.
The files that get created in these virtual disk supposed to get deleted after sometime. Can you please help us understand why we ran into this scenario.
Let me know if you need any other information from our end.
Below is the current utilization.
# df -lh
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rootvg-lvroot 2.0G 1.1G 808M 58% /
tmpfs 15G 0 15G 0% /dev/shm
/dev/sda1 243M 49M 181M 22% /boot
/dev/mapper/rootvg-lvopt 2.0G 92M 1.8G 5% /opt
/dev/mapper/rootvg-lvtmp 6.9G 249M 6.3G 4% /tmp
/dev/mapper/rootvg-lvusers 4.0G 137M 3.7G 4% /users
/dev/mapper/rootvg-lvusr 7.9G 6.2G 1.4G 82% /usr
/dev/mapper/rootvg-lvvar 30G 12G 17G 43% /var
/dev/mapper/vgapp-lvapp 49G 4.1G 42G 9% /app
/dev/mapper/vgapp-lvstore 69G 37G 30G 56% /store
/dev/mapper/vgapp-lvlocalnagios 128G 91G 31G 75% /usr/local/nagios
/dev/mapper/vgapp-lvmysql 69G 2.6G 63G 4% /var/lib/mysql
/dev/mapper/vgapp-lvmodgearlog 20G 173M 19G 1% /var/log/mod_gearman
/dev/mapper/vgapp-lvgearlog 20G 174M 19G 1% /var/log/gearmand
tmpfs 2.0G 52M 2.0G 3% /var/nagiosramdisk
bosecorp
Posts: 929 Joined: Thu Jun 26, 2014 1:00 pm
Post
by bosecorp » Mon Apr 25, 2016 12:51 pm
Again we are facing the same issue...
status info : host check orphaned, is the mod-gearman worker on queue 'hostgroup_gearman_dcn1' running?)
This is for most of the gearmans
Below is the current output for gearman_top
2016-04-25 13:47:54 - 10.100.30.113:4730 - v1.1.8
Queue Name | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------
check_results | 4 | 0 | 1
eventhandler | 26 | 0 | 0
host | 26 | 0 | 0
hostgroup_gearman_dca1 | 1 | 0 | 0
hostgroup_gearman_dce1 | 32 | 0 | 24
hostgroup_gearman_dcn1 | 29 | 0 | 16
hostgroup_gearman_dcn2 | 26 | 0 | 2
hostgroup_gearman_dcn3 | 23 | 0 | 0
hostgroup_gearman_hk1 | 1 | 0 | 0
hostgroup_gearman_mi1 | 1 | 0 | 0
hostgroup_gearman_my1 | 1 | 2 | 1
hostgroup_gearman_sl1 | 23 | 0 | 0
hostgroup_gearman_tj1 | 23 | 0 | 0
hosts | 1 | 0 | 0
service | 26 | 0 | 0
servicegroup_gearman_mrtg | 26 | 0 | 0
worker_gearmandce1 | 1 | 0 | 0
worker_gearmandcn1 | 1 | 0 | 0
worker_gearmandcn2 | 1 | 0 | 0
worker_gearmandcn3 | 0 | 0 | 0
----------------------------------------------------------------------------
# df -lh
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rootvg-lvroot 2.0G 1.1G 808M 58% /
tmpfs 15G 0 15G 0% /dev/shm
/dev/sda1 243M 49M 181M 22% /boot
/dev/mapper/rootvg-lvopt 2.0G 92M 1.8G 5% /opt
/dev/mapper/rootvg-lvtmp 6.9G 249M 6.3G 4% /tmp
/dev/mapper/rootvg-lvusers 4.0G 137M 3.7G 4% /users
/dev/mapper/rootvg-lvusr 7.9G 6.2G 1.4G 83% /usr
/dev/mapper/rootvg-lvvar 30G 12G 17G 43% /var
/dev/mapper/vgapp-lvapp 49G 4.1G 42G 9% /app
/dev/mapper/vgapp-lvstore 69G 37G 30G 56% /store
/dev/mapper/vgapp-lvlocalnagios 128G 91G 31G 75% /usr/local/nagios
/dev/mapper/vgapp-lvmysql 69G 2.6G 63G 4% /var/lib/mysql
/dev/mapper/vgapp-lvmodgearlog 20G 173M 19G 1% /var/log/mod_gearman
/dev/mapper/vgapp-lvgearlog 20G 174M 19G 1% /var/log/gearmand
tmpfs 2.0G 53M 2.0G 3% /var/nagiosramdisk
I would need someone to do a live troubleshooting and gather the required logs.
bosecorp
Posts: 929 Joined: Thu Jun 26, 2014 1:00 pm
Post
by bosecorp » Mon Apr 25, 2016 3:25 pm
We did got someone from Nagios phone support, and have tracked the root cause for this issue.
We will be monitoring our server for few days now.
tgriep
Madmin
Posts: 9190 Joined: Thu Oct 30, 2014 9:02 am
Post
by tgriep » Mon Apr 25, 2016 3:41 pm
Let me know how it works for you.
Be sure to check out our
Knowledgebase for helpful articles and solutions!
bosecorp
Posts: 929 Joined: Thu Jun 26, 2014 1:00 pm
Post
by bosecorp » Tue Apr 26, 2016 1:44 pm
As per our telephonic conversation yesterday with Tom Griep, he recommended us to implement the below changes.
Replace below
killproc_nagios ()
{
kill -s "$1" $NagiosPID
}
To killall -9 nagios
But after implementing this I tried to restart the Nagios service and I observed that it was not starting.
Before implementing the change
# /etc/init.d/nagios restart
Running configuration check...
Stopping nagios:. done.
Starting nagios: done.
After the change
# /etc/init.d/nagios restart
Running configuration check...
Stopping nagios:Killed
Can you please let us know the correct way to restart the service.
I have already dropped and email to him.
tgriep
Madmin
Posts: 9190 Joined: Thu Oct 30, 2014 9:02 am
Post
by tgriep » Tue Apr 26, 2016 4:46 pm
I would put back the following line in the /etc/init.d/nagios back until we figure out a better way to do it.
I am thinking that the timeout settings we did in that config file will resolve the issue.
Be sure to check out our
Knowledgebase for helpful articles and solutions!