Page 2 of 2
Re: missing bandwidth perf data for network devices
Posted: Mon Jan 04, 2016 2:51 pm
by rkennedy
How's the regular disk usage looking? (not just inodes)
This plugin should work for monitoring the file -
https://exchange.nagios.org/directory/A ... nt/details
From there, you could use event_handlers to trigger a bash script that checks out all possible thresholds and outputs it to a file.
Just to clarify - is this happening just for one service, or to your whole system?
Re: missing bandwidth perf data for network devices
Posted: Mon Jan 04, 2016 2:55 pm
by brdr
Thanks.
I first noticed it was happening to network devices on bandwidth. The latest issue (30-Dec) is system wide (all service checks and host checks stopped graphing).
Re: missing bandwidth perf data for network devices
Posted: Tue Jan 05, 2016 10:15 am
by ssax
Are you running gearman on this server?
Also, I found this thread here:
https://support.nagios.com/forum/viewto ... 57#p150057
jdalrymple wrote:CFT6Server wrote:Warning: fork() in my_system_r() failed for command
Sounds like potentially hitting a ulimit or a memory exhaustion issue.
Probably would be worthwhile to get a roundabout idea of your nagios process count and your memory usage:
Code: Select all
[root@localhost ~]# lsof | grep "^nagios" | wc -l
124
[root@limits ~]# cat /proc/`cat /usr/local/nagios/var/nagios.lock`/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 10485760 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 30385 30385 processes
Max open files 8192 8192 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 30385 30385 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
[root@localhost ~]# free
total used free shared buffers cached
Mem: 3908740 3195252 713488 28364 153964 2230408
-/+ buffers/cache: 810880 3097860
Swap: 2031612 0 2031612
Re: missing bandwidth perf data for network devices
Posted: Wed Jan 06, 2016 12:18 pm
by brdr
thanks ssax. Yes, we do run gearman on XI server using 2 worker servers. Definately agree with hitting ulimit, perhaps with 'max user processes' or 'open files'. I have attached ulimits (soft) for Nagios user below. Right now, user processes count for nagios is around 325, and open file count is around 2200.
Once i detect the pattern is showing up in nagios.log i can check limits.
Since the reboot on 31-dec we have not seen this issue.
[root@bed-600-124 archives]# grep -l my_system_r *.log
nagios-01-01-2016-00.log
nagios-06-18-2015-00.log
nagios-06-19-2015-00.log
nagios-06-20-2015-00.log
nagios-12-09-2015-00.log
nagios-12-10-2015-00.log
nagios-12-31-2015-00.log
[root@bed-600-124 archives]# su - nagios
[nagios@bed-600-124 ~]$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 95123
max locked memory (kbytes, -l) 128
max memory size (kbytes, -m) unlimited
open files (-n) 4096 (HARD LIMIT is also 4096)
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 20480
cpu time (seconds, -t) unlimited
max user processes (-u) 1024 (HARD LIMIT is 4096)
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Re: missing bandwidth perf data for network devices
Posted: Wed Jan 06, 2016 6:10 pm
by tmcdonald
Yea, fork problems for sure reek of resource limits being hit. How long do you estimate it will take for this to appear again?