Nagios XI keeps crashing post upgrade to XI 2014

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
chriscamm
Posts: 72
Joined: Thu Aug 22, 2013 6:12 am

Re: Nagios XI keeps crashing post upgrade to XI 2014

Post by chriscamm »

30 mins is the most I can keep the server up for without it crashing. Do you have any thoughts or suggestions?

Now Core stops working too and I have to run

service ndo2db stop

Then the nagios.log displays

Code: Select all

[05-06-2014 16:42:00] Warning: A system time change of 979 seconds (0d 0h 16m 19s forwards in time) has been detected. Compensating...
Informational Message
[05-06-2014 16:42:00] ndomod: Please check remote ndo2db log, database connection or SSL Parameters
Informational Message
[05-06-2014 16:42:00] ndomod: Error writing to data sink! Some output may get lost...
Thanks

Chris
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagios XI keeps crashing post upgrade to XI 2014

Post by abrist »

My suspicion is that you are hitting message limits:
Jun 4 22:20:01 qualngs ndo2db: Error: max retries exceeded sending message to queue. Kernel queue parameters may neeed to be tuned. See README.
Jun 4 22:20:01 qualngs ndo2db: Warning: queue send error, retrying...
Jun 4 22:20:08 qualngs ndo2db: Message sent to queue.
Jun 4 22:20:08 qualngs ndo2db: Warning: Retrying message send. This can occur because you have too few messages allowed or too few total bytes allowed in message queues. You are currently using 256000 of 256000 messages and 262144000 of 262144000 bytes in the queue. See README for kernel tuning options.
Jun 4 22:20:28 qualngs ndo2db: Error: max retries exceeded sending message to queue. Kernel queue parameters may neeed to be tuned. See README.
Is there any correlation between when your server crashes when these errors appear in your logs?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
chriscamm
Posts: 72
Joined: Thu Aug 22, 2013 6:12 am

Re: Nagios XI keeps crashing post upgrade to XI 2014

Post by chriscamm »

It could be

Here are my Kernel limits:

sysctl -p

Code: Select all

net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
error: "net.bridge.bridge-nf-call-ip6tables" is an unknown key
error: "net.bridge.bridge-nf-call-iptables" is an unknown key
error: "net.bridge.bridge-nf-call-arptables" is an unknown key
kernel.msgmnb = 262144000
kernel.msgmax = 262144000
kernel.shmmax = 4294967295
kernel.shmall = 268435456
kernel.msgmni = 256000
ulimit -a

Code: Select all

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 78917
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 78917
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
chriscamm
Posts: 72
Joined: Thu Aug 22, 2013 6:12 am

Re: Nagios XI keeps crashing post upgrade to XI 2014

Post by chriscamm »

I have been monitoring and I can see that when it crashes we do get a flood of errors

Code: Select all

Jun  5 18:04:49 qualngs ndo2db: Message sent to queue.
Jun  5 18:04:49 qualngs ndo2db: Warning: queue send error, retrying...
Jun  5 18:04:51 qualngs ndo2db: Message sent to queue.
Jun  5 18:04:51 qualngs ndo2db: Warning: queue send error, retrying...
Jun  5 18:04:54 qualngs ndo2db: Message sent to queue.
Jun  5 18:04:54 qualngs ndo2db: Warning: queue send error, retrying...
Jun  5 18:04:56 qualngs ndo2db: Message sent to queue.
Jun  5 18:04:56 qualngs ndo2db: Warning: queue send error, retrying...
Jun  5 18:04:59 qualngs ndo2db: Message sent to queue.
Jun  5 18:04:59 qualngs ndo2db: Warning: queue send error, retrying...
I will look at the limits

Chris
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Nagios XI keeps crashing post upgrade to XI 2014

Post by lmiltchev »

Let us know if you are still having issues after tuning the kernel options.
Be sure to check out our Knowledgebase for helpful articles and solutions!
chriscamm
Posts: 72
Joined: Thu Aug 22, 2013 6:12 am

Re: Nagios XI keeps crashing post upgrade to XI 2014

Post by chriscamm »

Kernel Tuned and now getting this

Code: Select all

        0 sh
Jun  5 19:35:43 qualngs kernel: [23866]   500 23866      296       11   0       0             0 sh
Jun  5 19:35:43 qualngs kernel: [23867]   500 23867      308       11   4       0             0 sh
Jun  5 19:35:43 qualngs kernel: [23868]   500 23868      296       10   2       0             0 sh
Jun  5 19:35:43 qualngs kernel: [23869]   500 23869      296       10   3       0             0 sh
Jun  5 19:35:43 qualngs kernel: [23870]   500 23870      296       11   3       0             0 sh
Jun  5 19:35:43 qualngs kernel: [23871]   500 23871      296       11   2       0             0 sh
Jun  5 19:35:43 qualngs kernel: [23872]   500 23872        3        2   0       0             0 sh
Jun  5 19:35:43 qualngs kernel: [23873]   500 23873      295        7   2       0             0 sh
Jun  5 19:35:43 qualngs kernel: [23874]   500 23874      296       10   1       0             0 sh
Jun  5 19:35:43 qualngs kernel: [23875]   500 23875      293        3   4       0             0 sh
Jun  5 19:35:43 qualngs kernel: [23876]   500 23876      295        7   1       0             0 sh
Jun  5 19:35:43 qualngs kernel: [23877]   500 23877    40282     3328   2       0             0 check_wmi_plus.
Jun  5 19:35:43 qualngs kernel: [23878]   500 23878    41035     4160   4       0             0 check_wmi_plus.
Jun  5 19:35:43 qualngs kernel: [23879]   500 23879    40281     3370   3       0             0 check_wmi_plus.
Jun  5 19:35:43 qualngs kernel: [23880]   500 23880    40282     3420   4       0             0 check_wmi_plus.
Jun  5 19:35:43 qualngs kernel: [23881]   500 23881    40282     3322   5       0             0 check_wmi_plus.
Jun  5 19:35:43 qualngs kernel: Out of memory: Kill process 2964 (nagios) score 2 or sacrifice child
Jun  5 19:35:43 qualngs kernel: Killed process 2964, UID 500, (nagios) total-vm:69452kB, anon-rss:14292kB, file-rss:44kB
Jun  5 19:36:17 qualngs ndo2db: Error: mysql_query() failed for 'UPDATE nagios_conninfo SET disconnect_time=NOW(), last_checkin_time=NOW(), data_end_time=FROM_UNIXTIME(0), bytes_processed='0', lines_processed='0', entries_processed='0' WHERE conninfo_id='0''
Jun  5 19:36:17 qualngs ndo2db: mysql_error: 'MySQL server has gone away'
Jun  5 19:36:17 qualngs ndo2db: Error: Connection to MySQL database has been lost!
Jun  5 19:36:32 qualngs vmsvc[1532]: [ warning] [vmsvc] Error in the RPC receive loop: RpcIn: Unable to send.
chriscamm
Posts: 72
Joined: Thu Aug 22, 2013 6:12 am

Re: Nagios XI keeps crashing post upgrade to XI 2014

Post by chriscamm »

More from the /var/log/messages

Code: Select all

Jun  5 19:35:43 qualngs kernel: [23881]   500 23881    40282     3349   5       0             0 check_wmi_plus.
Jun  5 19:35:43 qualngs kernel: Out of memory: Kill process 2964 (nagios) score 2 or sacrifice child
Jun  5 19:35:43 qualngs kernel: Killed process 2997, UID 500, (nagios) total-vm:65612kB, anon-rss:1564kB, file-rss:16kB
Jun  5 19:35:43 qualngs kernel: check_wmi_plus. invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jun  5 19:35:43 qualngs kernel: check_wmi_plus. cpuset=/ mems_allowed=0
Jun  5 19:35:43 qualngs kernel: Pid: 22341, comm: check_wmi_plus. Not tainted 2.6.32-431.11.2.el6.x86_64 #1
Jun  5 19:35:43 qualngs kernel: Call Trace:
Jun  5 19:35:43 qualngs kernel: [<ffffffff810d05a1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
Jun  5 19:35:43 qualngs kernel: [<ffffffff81122950>] ? dump_header+0x90/0x1b0
Jun  5 19:35:43 qualngs kernel: [<ffffffff81227a5c>] ? security_real_capable_noaudit+0x3c/0x70
Jun  5 19:35:43 qualngs kernel: [<ffffffff81122dd2>] ? oom_kill_process+0x82/0x2a0
Jun  5 19:35:43 qualngs kernel: [<ffffffff81122d11>] ? select_bad_process+0xe1/0x120
Jun  5 19:35:43 qualngs kernel: [<ffffffff81123210>] ? out_of_memory+0x220/0x3c0
Jun  5 19:35:43 qualngs kernel: [<ffffffff8112fb2f>] ? __alloc_pages_nodemask+0x89f/0x8d0
Jun  5 19:35:43 qualngs kernel: [<ffffffff81167baa>] ? alloc_pages_vma+0x9a/0x150
Jun  5 19:35:43 qualngs kernel: [<ffffffff8114acad>] ? handle_pte_fault+0x73d/0xb00
Jun  5 19:35:43 qualngs kernel: [<ffffffff81121480>] ? generic_file_aio_read+0x380/0x700
Jun  5 19:35:43 qualngs kernel: [<ffffffff8114b29a>] ? handle_mm_fault+0x22a/0x300
Jun  5 19:35:43 qualngs kernel: [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
Jun  5 19:35:43 qualngs kernel: [<ffffffff8114fd1b>] ? __vm_enough_memory+0x3b/0x190
Jun  5 19:35:43 qualngs kernel: [<ffffffff8152da7e>] ? do_page_fault+0x3e/0xa0
Jun  5 19:35:43 qualngs kernel: [<ffffffff8152ae35>] ? page_fault+0x25/0x30
Jun  5 19:35:43 qualngs kernel: Mem-Info:
Jun  5 19:35:43 qualngs kernel: Node 0 DMA per-cpu:
Jun  5 19:35:43 qualngs kernel: CPU    0: hi:    0, btch:   1 usd:   0
Jun  5 19:35:43 qualngs kernel: CPU    1: hi:    0, btch:   1 usd:   0
Jun  5 19:35:43 qualngs kernel: CPU    2: hi:    0, btch:   1 usd:   0
Jun  5 19:35:43 qualngs kernel: CPU    3: hi:    0, btch:   1 usd:   0
Jun  5 19:35:43 qualngs kernel: CPU    4: hi:    0, btch:   1 usd:   0
Jun  5 19:35:43 qualngs kernel: CPU    5: hi:    0, btch:   1 usd:   0
Jun  5 19:35:43 qualngs kernel: Node 0 DMA32 per-cpu:
Jun  5 19:35:43 qualngs kernel: CPU    0: hi:  186, btch:  31 usd:   0
Jun  5 19:35:43 qualngs kernel: CPU    1: hi:  186, btch:  31 usd:   0
Jun  5 19:35:43 qualngs kernel: CPU    2: hi:  186, btch:  31 usd:   0
Jun  5 19:35:43 qualngs kernel: CPU    3: hi:  186, btch:  31 usd:   0
Jun  5 19:35:43 qualngs kernel: CPU    4: hi:  186, btch:  31 usd:   0
Jun  5 19:35:43 qualngs kernel: CPU    5: hi:  186, btch:  31 usd:   0
Jun  5 19:35:43 qualngs kernel: Node 0 Normal per-cpu:
Jun  5 19:35:43 qualngs kernel: CPU    0: hi:  186, btch:  31 usd:   0
Jun  5 19:35:43 qualngs kernel: CPU    1: hi:  186, btch:  31 usd:   0
Jun  5 19:35:43 qualngs kernel: CPU    2: hi:  186, btch:  31 usd:   0
Jun  5 19:35:43 qualngs kernel: CPU    3: hi:  186, btch:  31 usd:   0
Jun  5 19:35:43 qualngs kernel: CPU    4: hi:  186, btch:  31 usd:  23
Jun  5 19:35:43 qualngs kernel: CPU    5: hi:  186, btch:  31 usd:   0
Jun  5 19:35:43 qualngs kernel: active_anon:1834041 inactive_anon:316860 isolated_anon:14656
Jun  5 19:35:43 qualngs kernel: active_file:155 inactive_file:248 isolated_file:196
Jun  5 19:35:43 qualngs kernel: unevictable:0 dirty:0 writeback:1242 unstable:0
Jun  5 19:35:43 qualngs kernel: free:28389 slab_reclaimable:6558 slab_unreclaimable:253041
Jun  5 19:35:43 qualngs kernel: mapped:363 shmem:762 pagetables:43001 bounce:0
Jun  5 19:35:43 qualngs kernel: Node 0 DMA free:15348kB min:96kB low:120kB high:144kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14952kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jun  5 19:35:43 qualngs kernel: lowmem_reserve[]: 0 3000 10070 10070
Jun  5 19:35:43 qualngs kernel: Node 0 DMA32 free:49856kB min:20104kB low:25128kB high:30156kB active_anon:2099864kB inactive_anon:519788kB active_file:196kB inactive_file:660kB unevictable:0kB isolated(anon):26496kB isolated(file):784kB present:3072096kB mlocked:0kB dirty:0kB writeback:1744kB mapped:920kB shmem:0kB slab_reclaimable:544kB slab_unreclaimable:12176kB kernel_stack:1864kB pagetables:44628kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:288 all_unreclaimable? no
Jun  5 19:35:43 qualngs kernel: lowmem_reserve[]: 0 0 7070 7070
Jun  5 19:35:43 qualngs kernel: Node 0 Normal free:48352kB min:47380kB low:59224kB high:71068kB active_anon:5236300kB inactive_anon:747652kB active_file:424kB inactive_file:332kB unevictable:0kB isolated(anon):32128kB isolated(file):0kB present:7239680kB mlocked:0kB dirty:0kB writeback:3224kB mapped:532kB shmem:3048kB slab_reclaimable:25688kB slab_unreclaimable:999988kB kernel_stack:7312kB pagetables:127376kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jun  5 19:35:43 qualngs kernel: lowmem_reserve[]: 0 0 0 0
Jun  5 19:35:43 qualngs kernel: Node 0 DMA: 1*4kB 0*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 3*4096kB = 15348kB
Jun  5 19:35:43 qualngs kernel: Node 0 DMA32: 440*4kB 647*8kB 254*16kB 66*32kB 10*64kB 3*128kB 2*256kB 1*512kB 2*1024kB 2*2048kB 7*4096kB = 49976kB
Jun  5 19:35:43 qualngs kernel: Node 0 Normal: 5819*4kB 2382*8kB 115*16kB 2*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 48332kB
Jun  5 19:35:43 qualngs kernel: 8420 total pagecache pages
Jun  5 19:35:43 qualngs kernel: 6766 pages in swap cache
Jun  5 19:35:43 qualngs kernel: Swap cache stats: add 541753, delete 534987, find 211067/227336
Jun  5 19:35:43 qualngs kernel: Free swap  = 0kB
Jun  5 19:35:43 qualngs kernel: Total swap = 1310704kB
Jun  5 19:35:43 qualngs kernel: 2621424 pages RAM
Jun  5 19:35:43 qualngs kernel: 90986 pages reserved
Jun  5 19:35:43 qualngs kernel: 50004 pages shared
Jun  5 19:35:43 qualngs kernel: 2460467 pages non-shared
Jun  5 19:35:43 qualngs kernel: [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
Jun  5 19:35:43 qualngs kernel: [  588]     0   588     2736        2   3     -17         -1000 udevd
Jun  5 19:35:43 qualngs kernel: [ 1532]     0  1532    44767       66   4       0             0 vmtoolsd
Jun  5 19:35:43 qualngs kernel: [ 1601]     0  1601    62272       52   0       0             0 rsyslogd
Jun  5 19:35:43 qualngs kernel: [ 1630]     0  1630     2738       38   3       0             0 irqbalance
Jun  5 19:35:43 qualngs kernel: [ 1727]    81  1727     5351        3   3       0             0 dbus-daemon
Jun  5 19:35:43 qualngs kernel: [ 1738]    70  1738     6918        3   2       0             0 avahi-daemon
Jun  5 19:35:43 qualngs kernel: [ 1739]    70  1739     6918        2   2       0             0 avahi-daemon
Jun  5 19:35:43 qualngs kernel: [ 1767]     0  1767     1020        2   5       0             0 acpid
Jun  5 19:35:43 qualngs kernel: [ 1776]    68  1776     9489       40   3       0             0 hald
Jun  5 19:35:43 qualngs kernel: [ 1777]     0  1777     5082        3   0       0             0 hald-runner
Jun  5 19:35:43 qualngs kernel: [ 1807]     0  1807     5612        3   4       0             0 hald-addon-inpu
Jun  5 19:35:43 qualngs kernel: [ 1823]    68  1823     4484        3   0       0             0 hald-addon-acpi
Jun  5 19:35:43 qualngs kernel: [ 1840]     0  1840    16652        2   1     -17         -1000 sshd
Jun  5 19:35:43 qualngs kernel: [ 1848]     0  1848     5545        3   0       0             0 xinetd
Jun  5 19:35:43 qualngs kernel: [ 1856]    38  1856     7680       27   4       0             0 ntpd
Jun  5 19:35:43 qualngs kernel: [ 1891]     0  1891    27051       35   4       0             0 mysqld_safe
Jun  5 19:35:43 qualngs kernel: [ 2036]    26  2036    54091       44   4     -17         -1000 postmaster
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios XI keeps crashing post upgrade to XI 2014

Post by scottwilkerson »

This basically looks like you are running out of memory, then the system starts shutting services down, nagios mysqld, etc..

How much memory is installed on this server? How many hosts/services do you have on the system?
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
chriscamm
Posts: 72
Joined: Thu Aug 22, 2013 6:12 am

Re: Nagios XI keeps crashing post upgrade to XI 2014

Post by chriscamm »

Hi,

The server has 6 x vCPU's, 12GB RAM monitoring about 200 hosts and 4500 services

I have added another 4GB RAM so let me see if that helps.

Thanks for the advice I will let you know if that helps

Chris
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagios XI keeps crashing post upgrade to XI 2014

Post by abrist »

Additionally, do you have a swap partition?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Locked