Page 3 of 3
Re: Perf Data stopped working
Posted: Mon Jan 11, 2016 1:18 pm
by CFT6Server
Code: Select all
# grep -v ^# /etc/mod_gearman/mod_gearman_neb.conf
debug=1
logfile=/var/log/mod_gearman/mod_gearman_neb.log
server=localhost:4730
eventhandler=yes
services=yes
hosts=yes
hostgroups=Network_ALL
servicegroups=ALL_Network_Bandwidth,WMI_CPU_Checks,WMI_IO_Checks,WMI_NETWORK_Checks
do_hostchecks=yes
route_eventhandler_like_checks=no
encryption=yes
key=somethinghere
use_uniq_jobs=on
localhostgroups=localhost
localservicegroups=
result_workers=1
perfdata=no
perfdata_mode=1
orphan_host_checks=yes
orphan_service_checks=yes
accept_clear_results=no
Code: Select all
# grep gearman /usr/local/nagios/etc/nagios.cfg
broker_module=/usr/lib64/mod_gearman/mod_gearman.o config=/etc/mod_gearman/mod_gearman_neb.conf eventhandler=no
Code: Select all
# grep -v ^# /etc/mod_gearman/mod_gearman_worker.conf
debug=1
logfile=/var/log/mod_gearman/mod_gearman_worker.log
server=<server>:4730
eventhandler=yes
services=no
hosts=no
hostgroups=Network_ALL
servicegroups=ALL_Network_Bandwidth
encryption=yes
key=<somethinghere>
job_timeout=120
min-worker=100
max-worker=500
idle-timeout=30
max-jobs=1000
spawn-rate=1
fork_on_exec=no
load_limit1=10
load_limit5=10
load_limit15=10
show_error_output=yes
enable_embedded_perl=on
use_embedded_perl_implicitly=off
use_perl_cache=on
p1_file=/usr/share/mod_gearman/mod_gearman_p1.pl
Re: Perf Data stopped working
Posted: Mon Jan 11, 2016 2:28 pm
by bheden
This may be helpful in eliminating your memory leak.
Please ensure that you have a backup of your server before you attempt the upgrade listed in the instructions below.
Code: Select all
cd /tmp
yum remove libgearman-devel libgearman gearmand mod_gearman
mkdir gearman_install
cd gearman_install/
wget http://mod-gearman.org/download/v2.1.1/rhel6/x86_64/gearmand-0.33-2.rhel6.x86_64.rpm
wget http://mod-gearman.org/download/v2.1.1/rhel6/x86_64/gearmand-devel-0.33-2.rhel6.x86_64.rpm
wget http://mod-gearman.org/download/v2.1.1/rhel6/x86_64/gearmand-server-0.33-2.rhel6.x86_64.rpm
wget http://mod-gearman.org/download/v2.1.1/rhel6/x86_64/mod_gearman2-2.1.1-1.rhel6.x86_64.rpm
yum --nogpgcheck localinstall *
sed -i 's/\(^broker_module=.*mod_gearman.*\)/#\1/' /usr/local/nagios/etc/nagios.cfg
echo "broker_module=/usr/lib64/mod_gearman2/mod_gearman2.o config=/etc/mod_gearman/mod_gearman_neb.conf eventhandler=no" >> /usr/local/nagios/etc/nagios.cfg
service nagios stop
service mod_gearman_worker stop
service gearmand stop
service gearmand start
service mod_gearman_worker start
service nagios start
Please inform us if this resolves your issue. Thank you.
Re: Perf Data stopped working
Posted: Thu Jan 28, 2016 1:09 pm
by CFT6Server
I have not updated this yet, but noticed that while I was away, performance graphs are no longer working again. I rebooted this morning, but the performance graphs are still missing. I can't seem to get them to show up....
Code: Select all
# tail -25 /usr/local/nagios/var/npcd.log
[01-28-2016 10:04:25] NPCD: ThreadCounter 4/5 File is 1454003856.perfdata.host
[01-28-2016 10:04:25] NPCD: Regular File: 1454003856.perfdata.host
[01-28-2016 10:04:25] NPCD: A thread was started on thread_counter = 4
[01-28-2016 10:04:25] NPCD: DEBUG: load 11.510000/20.000000
[01-28-2016 10:04:25] NPCD: ThreadCounter 5/5 File is 1454003856.perfdata.service
[01-28-2016 10:04:25] NPCD: Regular File: 1454003856.perfdata.service
[01-28-2016 10:04:25] NPCD: WARN: MAX Thread reached: 1454003856.perfdata.service comes later with ThreadCounter: 5
[01-28-2016 10:04:25] NPCD: DEBUG: Will wait for th['4']
[01-28-2016 10:04:25] NPCD: Processing file 1454003856.perfdata.host with ID 140518939485952 - going to exec /usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1454003856.perfdata.host
[01-28-2016 10:04:25] NPCD: Processing file '1454003856.perfdata.host'
[01-28-2016 10:04:43] NPCD: DEBUG: Will wait for th['3']
[01-28-2016 10:05:10] NPCD: ERROR: Executed command exits with return code '7'
[01-28-2016 10:05:10] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1454003826.perfdata.service'
[01-28-2016 10:05:14] NPCD: ERROR: Executed command exits with return code '7'
[01-28-2016 10:05:14] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1454003842.perfdata.service'
[01-28-2016 10:05:16] NPCD: DEBUG: Will wait for th['2']
[01-28-2016 10:05:16] NPCD: DEBUG: Will wait for th['1']
[01-28-2016 10:05:16] NPCD: DEBUG: Will wait for th['0']
[01-28-2016 10:05:16] NPCD: DEBUG: load 7.470000/20.000000
[01-28-2016 10:05:16] NPCD: ThreadCounter 0/5 File is 1454003856.perfdata.service
[01-28-2016 10:05:16] NPCD: Regular File: 1454003856.perfdata.service
[01-28-2016 10:05:16] NPCD: A thread was started on thread_counter = 0
[01-28-2016 10:05:16] NPCD: Have to wait: Filecounter = 46 - thread_counter = 1
[01-28-2016 10:05:16] NPCD: Processing file 1454003856.perfdata.service with ID 140518970955520 - going to exec /usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1454003856.perfdata.service
[01-28-2016 10:05:16] NPCD: Processing file '1454003856.perfdata.service'
Code: Select all
]# tail -25 /usr/local/nagios/var/perfdata.log
2016-01-28 10:05:47 [6831] [1] Found Performance Data for L2E-LAN-B02 / CPU_Busy_5_Sec ('CPU Busy 5 Sec'=1;80;90;)
2016-01-28 10:05:47 [6831] [2] No Custom Template found for check_snmp (/usr/local/nagios/etc/pnp/check_commands/check_snmp.cfg)
2016-01-28 10:05:47 [6831] [2] Template is check_snmp.php
2016-01-28 10:05:47 [6831] [2] data2rrd called
2016-01-28 10:05:47 [6831] [2] RRDs::update /usr/local/nagios/share/perfdata/L2E-LAN-B02/CPU_Busy_5_Sec.rrd 1454003842:1
2016-01-28 10:05:47 [6831] [2] /usr/local/nagios/share/perfdata/L2E-LAN-B02/CPU_Busy_5_Sec.rrd updated
2016-01-28 10:05:47 [6831] [2] Processing Line 77
2016-01-28 10:05:47 [6831] [2] Datatype set to 'SERVICEPERFDATA'
2016-01-28 10:05:47 [6831] [1] Found Performance Data for L2E-LAN-B01 / VPC-3_to_KDCNBUFLT-SW-01_Po3_Port_Channel_Bandwidth (in=.000005Gb/s;7;8 out=.000767Gb/s;7;8)
2016-01-28 10:05:47 [6831] [2] No Custom Template found for check_xi_service_mrtgtraf (/usr/local/nagios/etc/pnp/check_commands/check_xi_service_mrtgtraf.cfg)
2016-01-28 10:05:47 [6831] [2] Template is check_xi_service_mrtgtraf.php
2016-01-28 10:05:47 [6831] [2] No Custom Template found for check_xi_service_mrtgtraf (/usr/local/nagios/etc/pnp/check_commands/check_xi_service_mrtgtraf.cfg)
2016-01-28 10:05:47 [6831] [2] Template is check_xi_service_mrtgtraf.php
2016-01-28 10:05:47 [6831] [2] data2rrd called
2016-01-28 10:05:47 [6831] [2] RRDs::update /usr/local/nagios/share/perfdata/L2E-LAN-B01/VPC-3_to_KDCNBUFLT-SW-01_Po3_Port_Channel_Bandwidth.rrd 1454003842:.000005:.000767
2016-01-28 10:05:47 [6831] [2] /usr/local/nagios/share/perfdata/L2E-LAN-B01/VPC-3_to_KDCNBUFLT-SW-01_Po3_Port_Channel_Bandwidth.rrd updated
2016-01-28 10:05:47 [6831] [2] Processing Line 78
2016-01-28 10:05:47 [6831] [2] Datatype set to 'SERVICEPERFDATA'
2016-01-28 10:05:47 [6831] [1] Found Performance Data for L2E-LAN-B01 / e2_2_UCS_Domain__1_Interconnect_A_1_18_Bandwidth (in=.032351Gb/s;7;8 out=.027169Gb/s;7;8)
2016-01-28 10:05:47 [6831] [2] No Custom Template found for check_xi_service_mrtgtraf (/usr/local/nagios/etc/pnp/check_commands/check_xi_service_mrtgtraf.cfg)
2016-01-28 10:05:47 [6831] [2] Template is check_xi_service_mrtgtraf.php
2016-01-28 10:05:47 [6831] [2] No Custom Template found for check_xi_service_mrtgtraf (/usr/local/nagios/etc/pnp/check_commands/check_xi_service_mrtgtraf.cfg)
2016-01-28 10:05:47 [6831] [2] Template is check_xi_service_mrtgtraf.php
2016-01-28 10:05:47 [6831] [2] data2rrd called
2016-01-28 10:05:47 [6831] [2] RRDs::update /usr/local/nagios/share/perfdata/L2E-LAN-B01/e2_2_UCS_Domain__1_Interconnect_A_1_18_Bandwidth.rrd 1454003842:.032351:.027169
Code: Select all
# free -m
total used free shared buffers cached
Mem: 15950 6375 9574 27 104 4565
-/+ buffers/cache: 1705 14244
Swap: 2015 0 2015
Code: Select all
top - 10:09:48 up 1:09, 1 user, load average: 5.21, 7.13, 6.31
Tasks: 338 total, 1 running, 337 sleeping, 0 stopped, 0 zombie
Cpu(s): 11.5%us, 4.2%sy, 0.0%ni, 70.9%id, 12.3%wa, 0.1%hi, 1.0%si, 0.0%st
Mem: 16333268k total, 6565848k used, 9767420k free, 106868k buffers
Swap: 2064380k total, 0k used, 2064380k free, 4717636k cached
Re: Perf Data stopped working
Posted: Thu Jan 28, 2016 1:36 pm
by CFT6Server
Looks like some graphs are starting to show after an hour or so, but so far seems spotty.
Re: Perf Data stopped working
Posted: Thu Jan 28, 2016 3:07 pm
by scottwilkerson
Glad to hear they are coming back, albeit spotty.
I did notice in your previous post you have a fairly high I/O wait time... Do you have a RAM Disk setup on this server?
Re: Perf Data stopped working
Posted: Thu Jan 28, 2016 3:13 pm
by bheden
Since you haven't upgraded yet..
I rewrote the ModGearman install script to install/upgrade automatically and copy the necessary configuration files over.
This is still in testing, but has passed all of our internal tests so far.
If you'd like to give it a go (with the caveat of supplying feedback, of course), the URL is:
http://assets.nagios.com/downloads/nagi ... Install.sh
On your server:
Code: Select all
wget http://assets.nagios.com/downloads/nagiosxi/scripts/ModGearmanInstall.sh
chmod +x ModGearmanInstall.sh
./ModGearmanInstall.sh --server --upgrade
And then follow the prompts. Make sure you have a good backup before you start.

Re: Perf Data stopped working
Posted: Mon Feb 01, 2016 12:11 pm
by CFT6Server
I'll give this a try. We have multiple instances, and I am seeing this issue on a much smaller implementation also where performance graphs just stops working. Service checks are still all running fine.....
In case this provides some clues... I was able to catch this early enough today on this server, so the data stopped just before 8. after I rebooted the server, the perfs graphs looks like below. So perhaps that data is stuck?
Also using Box293's tool to review the data to confirm that it stopped before 8am...
Re: Perf Data stopped working
Posted: Mon Feb 01, 2016 4:18 pm
by rkennedy
Are you using a ramdisk on this implementation?
The data stopping can be related to memory being full, and not being able to process. After each reboot then, is the data always coming back with time?
Let us know how the updated gearman goes.