This support forum board is for support questions relating to
Nagios XI , our flagship commercial network monitoring solution.
chriscamm
Posts: 72 Joined: Thu Aug 22, 2013 6:12 am
Post
by chriscamm » Thu Jun 05, 2014 10:44 am
30 mins is the most I can keep the server up for without it crashing. Do you have any thoughts or suggestions?
Now Core stops working too and I have to run
service ndo2db stop
Then the nagios.log displays
Code: Select all
[05-06-2014 16:42:00] Warning: A system time change of 979 seconds (0d 0h 16m 19s forwards in time) has been detected. Compensating...
Informational Message
[05-06-2014 16:42:00] ndomod: Please check remote ndo2db log, database connection or SSL Parameters
Informational Message
[05-06-2014 16:42:00] ndomod: Error writing to data sink! Some output may get lost...
Thanks
Chris
abrist
Red Shirt
Posts: 8334 Joined: Thu Nov 15, 2012 1:20 pm
Post
by abrist » Thu Jun 05, 2014 11:00 am
My suspicion is that you are hitting message limits:
Jun 4 22:20:01 qualngs ndo2db: Error: max retries exceeded sending message to queue. Kernel queue parameters may neeed to be tuned. See README.
Jun 4 22:20:01 qualngs ndo2db: Warning: queue send error, retrying...
Jun 4 22:20:08 qualngs ndo2db: Message sent to queue.
Jun 4 22:20:08 qualngs ndo2db: Warning: Retrying message send. This can occur because you have too few messages allowed or too few total bytes allowed in message queues. You are currently using 256000 of 256000 messages and 262144000 of 262144000 bytes in the queue. See README for kernel tuning options.
Jun 4 22:20:28 qualngs ndo2db: Error: max retries exceeded sending message to queue. Kernel queue parameters may neeed to be tuned. See README.
Is there any correlation between when your server crashes when these errors appear in your logs?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the
Dark Side .
chriscamm
Posts: 72 Joined: Thu Aug 22, 2013 6:12 am
Post
by chriscamm » Thu Jun 05, 2014 11:16 am
It could be
Here are my Kernel limits:
sysctl -p
Code: Select all
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
error: "net.bridge.bridge-nf-call-ip6tables" is an unknown key
error: "net.bridge.bridge-nf-call-iptables" is an unknown key
error: "net.bridge.bridge-nf-call-arptables" is an unknown key
kernel.msgmnb = 262144000
kernel.msgmax = 262144000
kernel.shmmax = 4294967295
kernel.shmall = 268435456
kernel.msgmni = 256000
ulimit -a
Code: Select all
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 78917
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 78917
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
chriscamm
Posts: 72 Joined: Thu Aug 22, 2013 6:12 am
Post
by chriscamm » Thu Jun 05, 2014 12:14 pm
I have been monitoring and I can see that when it crashes we do get a flood of errors
Code: Select all
Jun 5 18:04:49 qualngs ndo2db: Message sent to queue.
Jun 5 18:04:49 qualngs ndo2db: Warning: queue send error, retrying...
Jun 5 18:04:51 qualngs ndo2db: Message sent to queue.
Jun 5 18:04:51 qualngs ndo2db: Warning: queue send error, retrying...
Jun 5 18:04:54 qualngs ndo2db: Message sent to queue.
Jun 5 18:04:54 qualngs ndo2db: Warning: queue send error, retrying...
Jun 5 18:04:56 qualngs ndo2db: Message sent to queue.
Jun 5 18:04:56 qualngs ndo2db: Warning: queue send error, retrying...
Jun 5 18:04:59 qualngs ndo2db: Message sent to queue.
Jun 5 18:04:59 qualngs ndo2db: Warning: queue send error, retrying...
I will look at the limits
Chris
lmiltchev
Bugs find me
Posts: 13589 Joined: Mon May 23, 2011 12:15 pm
Post
by lmiltchev » Thu Jun 05, 2014 1:48 pm
Let us know if you are still having issues after tuning the kernel options.
Be sure to check out our
Knowledgebase for helpful articles and solutions!
chriscamm
Posts: 72 Joined: Thu Aug 22, 2013 6:12 am
Post
by chriscamm » Thu Jun 05, 2014 2:15 pm
Kernel Tuned and now getting this
Code: Select all
0 sh
Jun 5 19:35:43 qualngs kernel: [23866] 500 23866 296 11 0 0 0 sh
Jun 5 19:35:43 qualngs kernel: [23867] 500 23867 308 11 4 0 0 sh
Jun 5 19:35:43 qualngs kernel: [23868] 500 23868 296 10 2 0 0 sh
Jun 5 19:35:43 qualngs kernel: [23869] 500 23869 296 10 3 0 0 sh
Jun 5 19:35:43 qualngs kernel: [23870] 500 23870 296 11 3 0 0 sh
Jun 5 19:35:43 qualngs kernel: [23871] 500 23871 296 11 2 0 0 sh
Jun 5 19:35:43 qualngs kernel: [23872] 500 23872 3 2 0 0 0 sh
Jun 5 19:35:43 qualngs kernel: [23873] 500 23873 295 7 2 0 0 sh
Jun 5 19:35:43 qualngs kernel: [23874] 500 23874 296 10 1 0 0 sh
Jun 5 19:35:43 qualngs kernel: [23875] 500 23875 293 3 4 0 0 sh
Jun 5 19:35:43 qualngs kernel: [23876] 500 23876 295 7 1 0 0 sh
Jun 5 19:35:43 qualngs kernel: [23877] 500 23877 40282 3328 2 0 0 check_wmi_plus.
Jun 5 19:35:43 qualngs kernel: [23878] 500 23878 41035 4160 4 0 0 check_wmi_plus.
Jun 5 19:35:43 qualngs kernel: [23879] 500 23879 40281 3370 3 0 0 check_wmi_plus.
Jun 5 19:35:43 qualngs kernel: [23880] 500 23880 40282 3420 4 0 0 check_wmi_plus.
Jun 5 19:35:43 qualngs kernel: [23881] 500 23881 40282 3322 5 0 0 check_wmi_plus.
Jun 5 19:35:43 qualngs kernel: Out of memory: Kill process 2964 (nagios) score 2 or sacrifice child
Jun 5 19:35:43 qualngs kernel: Killed process 2964, UID 500, (nagios) total-vm:69452kB, anon-rss:14292kB, file-rss:44kB
Jun 5 19:36:17 qualngs ndo2db: Error: mysql_query() failed for 'UPDATE nagios_conninfo SET disconnect_time=NOW(), last_checkin_time=NOW(), data_end_time=FROM_UNIXTIME(0), bytes_processed='0', lines_processed='0', entries_processed='0' WHERE conninfo_id='0''
Jun 5 19:36:17 qualngs ndo2db: mysql_error: 'MySQL server has gone away'
Jun 5 19:36:17 qualngs ndo2db: Error: Connection to MySQL database has been lost!
Jun 5 19:36:32 qualngs vmsvc[1532]: [ warning] [vmsvc] Error in the RPC receive loop: RpcIn: Unable to send.
chriscamm
Posts: 72 Joined: Thu Aug 22, 2013 6:12 am
Post
by chriscamm » Thu Jun 05, 2014 2:18 pm
More from the /var/log/messages
Code: Select all
Jun 5 19:35:43 qualngs kernel: [23881] 500 23881 40282 3349 5 0 0 check_wmi_plus.
Jun 5 19:35:43 qualngs kernel: Out of memory: Kill process 2964 (nagios) score 2 or sacrifice child
Jun 5 19:35:43 qualngs kernel: Killed process 2997, UID 500, (nagios) total-vm:65612kB, anon-rss:1564kB, file-rss:16kB
Jun 5 19:35:43 qualngs kernel: check_wmi_plus. invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jun 5 19:35:43 qualngs kernel: check_wmi_plus. cpuset=/ mems_allowed=0
Jun 5 19:35:43 qualngs kernel: Pid: 22341, comm: check_wmi_plus. Not tainted 2.6.32-431.11.2.el6.x86_64 #1
Jun 5 19:35:43 qualngs kernel: Call Trace:
Jun 5 19:35:43 qualngs kernel: [<ffffffff810d05a1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
Jun 5 19:35:43 qualngs kernel: [<ffffffff81122950>] ? dump_header+0x90/0x1b0
Jun 5 19:35:43 qualngs kernel: [<ffffffff81227a5c>] ? security_real_capable_noaudit+0x3c/0x70
Jun 5 19:35:43 qualngs kernel: [<ffffffff81122dd2>] ? oom_kill_process+0x82/0x2a0
Jun 5 19:35:43 qualngs kernel: [<ffffffff81122d11>] ? select_bad_process+0xe1/0x120
Jun 5 19:35:43 qualngs kernel: [<ffffffff81123210>] ? out_of_memory+0x220/0x3c0
Jun 5 19:35:43 qualngs kernel: [<ffffffff8112fb2f>] ? __alloc_pages_nodemask+0x89f/0x8d0
Jun 5 19:35:43 qualngs kernel: [<ffffffff81167baa>] ? alloc_pages_vma+0x9a/0x150
Jun 5 19:35:43 qualngs kernel: [<ffffffff8114acad>] ? handle_pte_fault+0x73d/0xb00
Jun 5 19:35:43 qualngs kernel: [<ffffffff81121480>] ? generic_file_aio_read+0x380/0x700
Jun 5 19:35:43 qualngs kernel: [<ffffffff8114b29a>] ? handle_mm_fault+0x22a/0x300
Jun 5 19:35:43 qualngs kernel: [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
Jun 5 19:35:43 qualngs kernel: [<ffffffff8114fd1b>] ? __vm_enough_memory+0x3b/0x190
Jun 5 19:35:43 qualngs kernel: [<ffffffff8152da7e>] ? do_page_fault+0x3e/0xa0
Jun 5 19:35:43 qualngs kernel: [<ffffffff8152ae35>] ? page_fault+0x25/0x30
Jun 5 19:35:43 qualngs kernel: Mem-Info:
Jun 5 19:35:43 qualngs kernel: Node 0 DMA per-cpu:
Jun 5 19:35:43 qualngs kernel: CPU 0: hi: 0, btch: 1 usd: 0
Jun 5 19:35:43 qualngs kernel: CPU 1: hi: 0, btch: 1 usd: 0
Jun 5 19:35:43 qualngs kernel: CPU 2: hi: 0, btch: 1 usd: 0
Jun 5 19:35:43 qualngs kernel: CPU 3: hi: 0, btch: 1 usd: 0
Jun 5 19:35:43 qualngs kernel: CPU 4: hi: 0, btch: 1 usd: 0
Jun 5 19:35:43 qualngs kernel: CPU 5: hi: 0, btch: 1 usd: 0
Jun 5 19:35:43 qualngs kernel: Node 0 DMA32 per-cpu:
Jun 5 19:35:43 qualngs kernel: CPU 0: hi: 186, btch: 31 usd: 0
Jun 5 19:35:43 qualngs kernel: CPU 1: hi: 186, btch: 31 usd: 0
Jun 5 19:35:43 qualngs kernel: CPU 2: hi: 186, btch: 31 usd: 0
Jun 5 19:35:43 qualngs kernel: CPU 3: hi: 186, btch: 31 usd: 0
Jun 5 19:35:43 qualngs kernel: CPU 4: hi: 186, btch: 31 usd: 0
Jun 5 19:35:43 qualngs kernel: CPU 5: hi: 186, btch: 31 usd: 0
Jun 5 19:35:43 qualngs kernel: Node 0 Normal per-cpu:
Jun 5 19:35:43 qualngs kernel: CPU 0: hi: 186, btch: 31 usd: 0
Jun 5 19:35:43 qualngs kernel: CPU 1: hi: 186, btch: 31 usd: 0
Jun 5 19:35:43 qualngs kernel: CPU 2: hi: 186, btch: 31 usd: 0
Jun 5 19:35:43 qualngs kernel: CPU 3: hi: 186, btch: 31 usd: 0
Jun 5 19:35:43 qualngs kernel: CPU 4: hi: 186, btch: 31 usd: 23
Jun 5 19:35:43 qualngs kernel: CPU 5: hi: 186, btch: 31 usd: 0
Jun 5 19:35:43 qualngs kernel: active_anon:1834041 inactive_anon:316860 isolated_anon:14656
Jun 5 19:35:43 qualngs kernel: active_file:155 inactive_file:248 isolated_file:196
Jun 5 19:35:43 qualngs kernel: unevictable:0 dirty:0 writeback:1242 unstable:0
Jun 5 19:35:43 qualngs kernel: free:28389 slab_reclaimable:6558 slab_unreclaimable:253041
Jun 5 19:35:43 qualngs kernel: mapped:363 shmem:762 pagetables:43001 bounce:0
Jun 5 19:35:43 qualngs kernel: Node 0 DMA free:15348kB min:96kB low:120kB high:144kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14952kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jun 5 19:35:43 qualngs kernel: lowmem_reserve[]: 0 3000 10070 10070
Jun 5 19:35:43 qualngs kernel: Node 0 DMA32 free:49856kB min:20104kB low:25128kB high:30156kB active_anon:2099864kB inactive_anon:519788kB active_file:196kB inactive_file:660kB unevictable:0kB isolated(anon):26496kB isolated(file):784kB present:3072096kB mlocked:0kB dirty:0kB writeback:1744kB mapped:920kB shmem:0kB slab_reclaimable:544kB slab_unreclaimable:12176kB kernel_stack:1864kB pagetables:44628kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:288 all_unreclaimable? no
Jun 5 19:35:43 qualngs kernel: lowmem_reserve[]: 0 0 7070 7070
Jun 5 19:35:43 qualngs kernel: Node 0 Normal free:48352kB min:47380kB low:59224kB high:71068kB active_anon:5236300kB inactive_anon:747652kB active_file:424kB inactive_file:332kB unevictable:0kB isolated(anon):32128kB isolated(file):0kB present:7239680kB mlocked:0kB dirty:0kB writeback:3224kB mapped:532kB shmem:3048kB slab_reclaimable:25688kB slab_unreclaimable:999988kB kernel_stack:7312kB pagetables:127376kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jun 5 19:35:43 qualngs kernel: lowmem_reserve[]: 0 0 0 0
Jun 5 19:35:43 qualngs kernel: Node 0 DMA: 1*4kB 0*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 3*4096kB = 15348kB
Jun 5 19:35:43 qualngs kernel: Node 0 DMA32: 440*4kB 647*8kB 254*16kB 66*32kB 10*64kB 3*128kB 2*256kB 1*512kB 2*1024kB 2*2048kB 7*4096kB = 49976kB
Jun 5 19:35:43 qualngs kernel: Node 0 Normal: 5819*4kB 2382*8kB 115*16kB 2*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 48332kB
Jun 5 19:35:43 qualngs kernel: 8420 total pagecache pages
Jun 5 19:35:43 qualngs kernel: 6766 pages in swap cache
Jun 5 19:35:43 qualngs kernel: Swap cache stats: add 541753, delete 534987, find 211067/227336
Jun 5 19:35:43 qualngs kernel: Free swap = 0kB
Jun 5 19:35:43 qualngs kernel: Total swap = 1310704kB
Jun 5 19:35:43 qualngs kernel: 2621424 pages RAM
Jun 5 19:35:43 qualngs kernel: 90986 pages reserved
Jun 5 19:35:43 qualngs kernel: 50004 pages shared
Jun 5 19:35:43 qualngs kernel: 2460467 pages non-shared
Jun 5 19:35:43 qualngs kernel: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
Jun 5 19:35:43 qualngs kernel: [ 588] 0 588 2736 2 3 -17 -1000 udevd
Jun 5 19:35:43 qualngs kernel: [ 1532] 0 1532 44767 66 4 0 0 vmtoolsd
Jun 5 19:35:43 qualngs kernel: [ 1601] 0 1601 62272 52 0 0 0 rsyslogd
Jun 5 19:35:43 qualngs kernel: [ 1630] 0 1630 2738 38 3 0 0 irqbalance
Jun 5 19:35:43 qualngs kernel: [ 1727] 81 1727 5351 3 3 0 0 dbus-daemon
Jun 5 19:35:43 qualngs kernel: [ 1738] 70 1738 6918 3 2 0 0 avahi-daemon
Jun 5 19:35:43 qualngs kernel: [ 1739] 70 1739 6918 2 2 0 0 avahi-daemon
Jun 5 19:35:43 qualngs kernel: [ 1767] 0 1767 1020 2 5 0 0 acpid
Jun 5 19:35:43 qualngs kernel: [ 1776] 68 1776 9489 40 3 0 0 hald
Jun 5 19:35:43 qualngs kernel: [ 1777] 0 1777 5082 3 0 0 0 hald-runner
Jun 5 19:35:43 qualngs kernel: [ 1807] 0 1807 5612 3 4 0 0 hald-addon-inpu
Jun 5 19:35:43 qualngs kernel: [ 1823] 68 1823 4484 3 0 0 0 hald-addon-acpi
Jun 5 19:35:43 qualngs kernel: [ 1840] 0 1840 16652 2 1 -17 -1000 sshd
Jun 5 19:35:43 qualngs kernel: [ 1848] 0 1848 5545 3 0 0 0 xinetd
Jun 5 19:35:43 qualngs kernel: [ 1856] 38 1856 7680 27 4 0 0 ntpd
Jun 5 19:35:43 qualngs kernel: [ 1891] 0 1891 27051 35 4 0 0 mysqld_safe
Jun 5 19:35:43 qualngs kernel: [ 2036] 26 2036 54091 44 4 -17 -1000 postmaster
scottwilkerson
DevOps Engineer
Posts: 19396 Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:
Post
by scottwilkerson » Thu Jun 05, 2014 3:19 pm
This basically looks like you are running out of memory, then the system starts shutting services down, nagios mysqld, etc..
How much memory is installed on this server? How many hosts/services do you have on the system?
chriscamm
Posts: 72 Joined: Thu Aug 22, 2013 6:12 am
Post
by chriscamm » Thu Jun 05, 2014 4:30 pm
Hi,
The server has 6 x vCPU's, 12GB RAM monitoring about 200 hosts and 4500 services
I have added another 4GB RAM so let me see if that helps.
Thanks for the advice I will let you know if that helps
Chris
abrist
Red Shirt
Posts: 8334 Joined: Thu Nov 15, 2012 1:20 pm
Post
by abrist » Thu Jun 05, 2014 4:34 pm
Additionally, do you have a swap partition?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the
Dark Side .