Page 2 of 3
Re: Issues with graphs caused by nagios process memory leak
Posted: Wed Jan 06, 2016 3:35 pm
by WillemDH
Sean,
Yes I am running gearman. All our mrtg related services are running on a mod gearman worker node.
Code: Select all
rpm -qa | grep gearman
libgearman-1.1.8-2.el6.x86_64
gearmand-1.1.8-2.el6.x86_64
mod_gearman-1.5.0b1-1.el6.x86_64
Code: Select all
grep -v ^# /etc/mod_gearman/mod_gearman_worker.conf | sort -u
debug=0
enable_embedded_perl=on
encryption=yes
eventhandler=yes
fork_on_exec=no
hosts=yes
idle-timeout=30
job_timeout=60
key=pjJeLzIJkFgRyUNxgocg
load_limit1=0
load_limit15=0
load_limit5=0
logfile=/var/log/mod_gearman/mod_gearman_worker.log
max-jobs=1000
max-worker=50
min-worker=5
p1_file=/usr/share/mod_gearman/mod_gearman_p1.pl
server=127.0.0.1:4730
services=yes
show_error_output=yes
spawn-rate=1
use_embedded_perl_implicitly=off
use_perl_cache=on
workaround_rc_25=off
/etc/mod_gearman2/worker.conf is non-existant.
Code: Select all
grep -v ^# /etc/mod_gearman2/worker.conf | sort -u
grep: /etc/mod_gearman2/worker.conf: No such file or directory
You have new mail in /var/spool/mail/root
[root@srvnagios01 ~]# locate worker.conf
/etc/mod_gearman/mod_gearman_worker.conf
/usr/share/mod_gearman/standalone_worker.conf
Code: Select all
grep "Core Worker" /usr/local/nagios/var/nagios.log
[1452039488] wproc: Core Worker 941: job 69756 (pid=50579) timed out. Killing it
[1452039488] wproc: CHECK job 69756 from worker Core Worker 941 timed out after 30.00s
[1452039488] wproc: Core Worker 941: job 69756 (pid=50579): Dormant child reaped
[1452042709] wproc: Core Worker 947: job 76554 (pid=21252) timed out. Killing it
[1452042709] wproc: CHECK job 76554 from worker Core Worker 947 timed out after 30.01s
[1452042709] wproc: Core Worker 947: job 76554 (pid=21252): Dormant child reaped
[1452072759] wproc: Core Worker 941: job 141052 (pid=59838) timed out. Killing it
[1452072759] wproc: CHECK job 141052 from worker Core Worker 941 timed out after 30.00s
[1452072759] wproc: Core Worker 941: job 141052 (pid=59838): Dormant child reaped
[1452073538] wproc: Registry request: name=Core Worker 20713;pid=20713
[1452073538] wproc: Registry request: name=Core Worker 20714;pid=20714
[1452073538] wproc: Registry request: name=Core Worker 20715;pid=20715
[1452073538] wproc: Registry request: name=Core Worker 20719;pid=20719
[1452073538] wproc: Registry request: name=Core Worker 20720;pid=20720
[1452073538] wproc: Registry request: name=Core Worker 20716;pid=20716
[1452073538] wproc: Registry request: name=Core Worker 20721;pid=20721
[1452073538] wproc: Registry request: name=Core Worker 20717;pid=20717
[1452073538] wproc: Registry request: name=Core Worker 20718;pid=20718
/usr/local/nagios/var/nagios.log is rotated daily, so I can't go back to past Sunday. The memory is rising again, I'm still not sure if an Apply Configuration also resets the memory usage. I think it does,, but I'll confirm one of the following days.
Scott,
I'm using this plugin I've been working on:
https://exchange.nagios.org/directory/P ... ss/details
This is the relevant code to determine memory percentage of the passed process:
Code: Select all
CheckMem=`ps -C $Process -o%mem= | paste -sd+ | bc`
RoundedMemResult=`echo $CheckMem | awk '{print int($1+0.5)}'`
Let me know if i can provide any other information.
Grtz
Willem
Re: Issues with graphs caused by nagios process memory leak
Posted: Thu Jan 07, 2016 2:09 pm
by tmcdonald
Just had a conversation with the devs, here's what we are thinking right now:
- We tend to believe the memory leak is in the gearman broker module
- Temporary bandaid is to restart Nagios every 24 hours, or as often as needed (an event handler set to 80% memory use)
- Devs are setting up test systems to try and replicate
Can you please verify the below information for your system?
Nagios XI v5.2.2
Nagios Core v4.1.1
mod_gearman v1.5
Number of hosts: 790
Number of services: 14780
Re: Issues with graphs caused by nagios process memory leak
Posted: Fri Jan 08, 2016 4:07 am
by WillemDH
Trevor,
Thanks for the update. I'll make a cronjob which restarts the Nagios service every 24 hours.
This is the current correct information about our setup:
Nagios XI v5.2.3
Nagios Core v4.1.1
Mod Gearman v1.5
Total Hosts: 806
Total Services: 15019
I also sent you a pm with the system profile. I noticed an error in the system profile about our certificate not matching localhost. I'll make a new thread about this, as I noticed this popping up during a tail of cmdsubsys.log
Code: Select all
ERROR: certificate common name "nagios.fqdn" doesn't match requested host name "localhost".
Let me know if you need anything else.
Re: Issues with graphs caused by nagios process memory leak
Posted: Fri Jan 08, 2016 3:36 pm
by tmcdonald
Thanks for the info! We're working on getting this sort of info from a few different users so we can find a common thread between them. For now unfortunately the answer is to do the restart of nagios as often as needed, but we're all thinking it has to do with mod_gearman at the moment. If this is the case it makes it a little more awkward to resolve since that is a third-party project, but we'll do what we can to work with their devs.
Re: Issues with graphs caused by nagios process memory leak
Posted: Fri Jan 08, 2016 4:57 pm
by SteveBeauchemin
Was seeing this same thing on my old server.
Had to apply config changes in a day or so, otherwise it was a reboot if I waited too long.
Gearman version was --
rpm -qa | grep gear
gearmand-1.1.12-2.el7.x86_64
libgearman-1.1.12-2.el7.x86_64
mod_gearman-1.5.0b1-1.el6.x86_64
On my new server, I am using different versions. Can go about 4 weeks before a meltdown now.
rpm -qa | grep gear
gearmand-devel-0.33-2.x86_64
gearmand-0.33-2.x86_64
gearmand-server-0.33-2.x86_64
mod_gearman2-2.1.2-1.el6.x86_64
Still leaking on the newer version, but I know I will commit changes and reset the memory leak in there at some point.
I have an older post where I asked about the gearman version - was given the gearman 2 stuff.
I will post my notes in a new topic, all the little name changes and tweaks it took to get gearman 2 working.
It's working quite well for me now.
Steve B
Re: Issues with graphs caused by nagios process memory leak
Posted: Sat Jan 09, 2016 6:48 am
by WillemDH
Thanks, I need to setup a new mod gearman worker nodes one of those days and the plan was to use mod gearman 2, so I'm looking forward to your notes Steve. If it works better then 1.5, I might update my mrtg worker node too.
Re: Issues with graphs caused by nagios process memory leak
Posted: Mon Jan 11, 2016 1:19 pm
by bheden
Willem,
We've been able to consistently reproduce the memory leak. The solution is similar to the post found here:
https://support.nagios.com/forum/viewto ... 497#167237.
I've tested with the installation procedure listed below, and it is working well in my environment:
Code: Select all
yum remove gearmand libgearman mod_gearman libgearman-devel
cd /tmp
mkdir gearman_install
cd gearman_install/
wget http://mod-gearman.org/download/v2.1.1/rhel6/x86_64/gearmand-0.33-2.rhel6.x86_64.rpm
wget http://mod-gearman.org/download/v2.1.1/rhel6/x86_64/gearmand-devel-0.33-2.rhel6.x86_64.rpm
wget http://mod-gearman.org/download/v2.1.1/rhel6/x86_64/gearmand-server-0.33-2.rhel6.x86_64.rpm
wget http://mod-gearman.org/download/v2.1.1/rhel6/x86_64/mod_gearman2-2.1.1-1.rhel6.x86_64.rpm
yum --nogpgcheck localinstall *
sed -i 's/\(^broker_module=.*mod_gearman.*\)/#\1/' /usr/local/nagios/etc/nagios.cfg
echo "broker_module=/usr/lib64/mod_gearman2/mod_gearman2.o config=/etc/mod_gearman/mod_gearman_neb.conf eventhandler=no" >> /usr/local/nagios/etc/nagios.cfg
There may still be a memory leak, as SteveBeauchemin mentioned, but it is definitely not as hindering (or obvious).
Re: Issues with graphs caused by nagios process memory leak
Posted: Fri Feb 12, 2016 7:09 am
by WillemDH
Bryan,
Luckily I tried this on my Nagios QA server today. This means it was a fresh install of gearman 2.1.5. I followed your recommendations and also tried the guide of Steve. Installation of gearman2 seemed to work ok. But when I try to restart the nagios service it fails. I'm unable at the moment to get the nagios service to start on my QA Nagios server.
Code: Select all
root@vnagiosqa:~ # service mod-gearman2-worker start [16-02-12 13:23:41]
Starting mod_gearman2_worker...OK
root@snagiosqa:~ # service gearmand start [16-02-12 13:23:46]
Starting gearmand: [ OK ]
root@nagiosqa:~ # service nagios start [16-02-12 13:23:51]
Starting nagios: done.
root@nagiosqa:~ # service nagios status [16-02-12 13:23:56]
nagios is not running
When I comment the
Code: Select all
broker_module=/usr/lib64/mod_gearman2/mod_gearman2.o config=/etc/mod_gearman2/module.conf
I can start the nagios service again.
Going to investigate further. Please let me know if you have an idea what could be the reason.
I can see this in the /var/log/messages
Code: Select all
Feb 12 13:23:28 nagiosqa nagios: Caught SIGTERM, shutting down...
Feb 12 13:23:28 nagiosqa nagios: Successfully shutdown... (PID=5262)
Feb 12 13:23:28 nagiosqa nagios: Event broker module 'NERD' deinitialized successfully.
Feb 12 13:23:28 nagiosqa nagios: ndomod: Shutdown complete.
Feb 12 13:23:28 nagiosqa nagios: Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
Feb 12 13:23:56 nagiosqa nagios: Nagios 4.1.1 starting... (PID=31583)
Feb 12 13:23:56 nagiosqa nagios: Local time is Fri Feb 12 13:23:56 CET 2016
Feb 12 13:23:56 nagiosqa nagios: LOG VERSION: 2.0
Feb 12 13:23:56 nagiosqa nagios: qh: Socket '/usr/local/nagios/var/rw/nagios.qh' successfully initialized
Feb 12 13:23:56 nagiosqa nagios: qh: core query handler registered
Feb 12 13:23:56 nagiosqa nagios: nerd: Channel hostchecks registered successfully
Feb 12 13:23:56 nagiosqa nagios: nerd: Channel servicechecks registered successfully
Feb 12 13:23:56 nagiosqa nagios: nerd: Channel opathchecks registered successfully
Feb 12 13:23:56 nagiosqa nagios: nerd: Fully initialized and ready to rock!
Feb 12 13:23:56 nagiosqa nagios: wproc: Successfully registered manager as @wproc with query handler
Feb 12 13:23:56 nagiosqa nagios: wproc: Registry request: name=Core Worker 31588;pid=31588
Feb 12 13:23:56 nagiosqa nagios: wproc: Registry request: name=Core Worker 31587;pid=31587
Feb 12 13:23:56 nagiosqa nagios: wproc: Registry request: name=Core Worker 31586;pid=31586
Feb 12 13:23:56 nagiosqa nagios: wproc: Registry request: name=Core Worker 31585;pid=31585
Feb 12 13:23:56 nagiosqa nagios: Error: Could not load module '/usr/lib64/mod_gearman2/mod_gearman2.o' -> /usr/lib64/mod_gearman2/mod_gearman2.o: undefined symbol: nm_log
Feb 12 13:23:56 nagiosqa nagios: Error: Failed to load module '/usr/lib64/mod_gearman2/mod_gearman2.o'.
Feb 12 13:23:56 nagiosqa nagios: ndomod: NDOMOD 2.0.0 (02-28-2014) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
Feb 12 13:23:56 nagiosqa nagios: ndomod: Successfully connected to data sink. 0 queued items to flush.
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for process data
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for log data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for system command data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for event handler data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for notification data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for comment data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for downtime data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for flapping data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for program status data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for host status data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for service status data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for adaptive program data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for adaptive host data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for adaptive service data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for external command data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for aggregated status data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for retention data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for contact data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for contact notification data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for acknowledgement data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for state change data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for contact status data'
Feb 12 13:23:56 nagiosqa nagios: ndomod registered for adaptive contact data'
Feb 12 13:23:56 nagiosqa nagios: Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
Feb 12 13:23:56 nagiosqa nagios: Error: Module loading failed. Aborting.
I tried setting debug = 2 in /etc/mod_gearman2/module.conf but nothing appeared in /var/log/mod_gearman2/mod_gearman_neb.log
EDIT: Tried again with the 2.1.2-1 and this version seems to work ok on my Nagios QA server. 2.1.5-1 is not working however..
Next problem: it seems i'm unable to install the gearmand package on my CentOS 7 worker node..
Code: Select all
yum deplist gearmand-server-0.33-2.rhel6.x86_64.rpm | awk '/provider:/ {print $2}' | sort -u
bash.x86_64
boost-program-options.i686
boost-program-options.x86_64
glibc.i686
glibc.x86_64
libevent.i686
libevent.x86_64
libgcc.x86_64
libstdc++.x86_64
[root@nagext gearman_install]# yum install boost-program-options.x86_64
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: centos.mirror.nucleus.be
* epel: epel.mirror.nucleus.be
* extras: centos.mirror.nucleus.be
* updates: centos.mirror.nucleus.be
Package boost-program-options-1.53.0-25.el7.x86_64 already installed and latest version
Nothing to do
[root@nagext gearman_install]# yum install ./gearmand-0.33-2.rhel6.x86_64.rpm
Loaded plugins: fastestmirror
Examining ./gearmand-0.33-2.rhel6.x86_64.rpm: 1:gearmand-0.33-2.x86_64
Marking ./gearmand-0.33-2.rhel6.x86_64.rpm to be installed
Resolving Dependencies
--> Running transaction check
---> Package gearmand.x86_64 1:0.33-2 will be installed
--> Processing Dependency: libboost_program_options-mt.so.5()(64bit) for package: 1:gearmand-0.33-2.x86_64
Loading mirror speeds from cached hostfile
* base: centos.mirror.nucleus.be
* epel: epel.mirror.nucleus.be
* extras: centos.mirror.nucleus.be
* updates: centos.mirror.nucleus.be
--> Finished Dependency Resolution
Error: Package: 1:gearmand-0.33-2.x86_64 (/gearmand-0.33-2.rhel6.x86_64)
Requires: libboost_program_options-mt.so.5()(64bit)
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest
[root@nagext gearman_install]# yum --nogpgcheck localinstall gearmand-
gearmand-0.33-2.rhel6.x86_64.rpm gearmand-devel-0.33-2.rhel6.x86_64.rpm gearmand-server-0.33-2.rhel6.x86_64.rpm
[root@nagext gearman_install]# yum --nogpgcheck localinstall gearmand-0.33-2.rhel6.x86_64.rpm
Loaded plugins: fastestmirror
Examining gearmand-0.33-2.rhel6.x86_64.rpm: 1:gearmand-0.33-2.x86_64
Marking gearmand-0.33-2.rhel6.x86_64.rpm to be installed
Resolving Dependencies
--> Running transaction check
---> Package gearmand.x86_64 1:0.33-2 will be installed
--> Processing Dependency: libboost_program_options-mt.so.5()(64bit) for package: 1:gearmand-0.33-2.x86_64
Loading mirror speeds from cached hostfile
* base: centos.mirror.nucleus.be
* epel: epel.mirror.nucleus.be
* extras: centos.mirror.nucleus.be
* updates: centos.mirror.nucleus.be
--> Finished Dependency Resolution
Error: Package: 1:gearmand-0.33-2.x86_64 (/gearmand-0.33-2.rhel6.x86_64)
Requires: libboost_program_options-mt.so.5()(64bit)
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest
[root@nagext gearman_install]# yum install boost-program-options.x86_64
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: centos.mirror.nucleus.be
* epel: epel.mirror.nucleus.be
* extras: centos.mirror.nucleus.be
* updates: centos.mirror.nucleus.be
Package boost-program-options-1.53.0-25.el7.x86_64 already installed and latest version
Nothing to do
What package do I need to be able to install the required gearman2 packages? Got these errors:
Code: Select all
Error: Package: 1:gearmand-server-0.33-2.x86_64 (/gearmand-server-0.33-2.rhel6.x86_64)
Requires: libboost_program_options-mt.so.5()(64bit)
Error: Package: 1:gearmand-0.33-2.x86_64 (/gearmand-0.33-2.rhel6.x86_64)
Requires: libboost_program_options-mt.so.5()(64bit)
Error: Package: 1:gearmand-server-0.33-2.x86_64 (/gearmand-server-0.33-2.rhel6.x86_64)
Requires: libevent-1.4.so.2()(64bit)
Grtz
Re: Issues with graphs caused by nagios process memory leak
Posted: Fri Feb 12, 2016 10:59 am
by bheden
For CentOS 7, you'll need to install an additional package:
The gearmand-debuginfo-VERSION.rhel7.x86_64.rpm for your setup. We have it (v2.1.1) hosted at
https://assets.nagios.com/downloads/mod ... x86_64.rpm
Is the QA Server also running Cent7? If so, you need that package as well.
From the looks of your output, you might want try and install boost-devel and then try the install again.
Re: Issues with graphs caused by nagios process memory leak
Posted: Fri Feb 12, 2016 11:55 am
by WillemDH
Bryan,
Code: Select all
Package boost-devel-1.41.0-27.el6.x86_64 already installed and latest version
on both XI QA and worker node.
I installed
https://assets.nagios.com/downloads/mod ... x86_64.rpm on the worker. The XI QA is a CentOS 6. I will continue working on this next week. Have a nice weekend!
Grtz
Willem