OOM-Killer kill nagios process
Posted: Tue Oct 01, 2013 6:43 am
After 12 or 14 days working ok my nagios server crash. When the server begin not working properly, it show me in /var/log/messages the kernel OOM Killer, killing a lot of proccesses as nagios. The first one who is killed by "out of memory" is nagios, so I suspect nagios has a bug or the bug is in the kernel.
It´s very strange because ram memory is used 25% only continualy. CPU's consume is very little (1%). The hardware is a HP Proliant DL360 with 4GB ram and dual core. Whe don´t believe it!!
The kernel is 2.6.35-28-generic-pae, distro Ubuntu 10.10, and Nagios 3.2.3.
Nagios send pings and snmpget to 44 different machiches, to konw if it is up or not by ping, and some details of the up or down state by snmp.
The crash looks like this:
Sep 4 16:07:29 segur-ProLiant-DL360-G6 snmptt[0]: .1.3.6.1.4.1.248.32.16.9.0.0.58 Normal "Status Events" 10.101.1.117 - Cl.-Connected Hirschmann BAT54-F Client+ 8.00.0167 / 29.06.2010 959117701130 UT_117 1270 WLAN-1 3
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552037] nagios invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552043] nagios cpuset=/ mems_allowed=0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552047] Pid: 1139, comm: nagios Not tainted 2.6.35-28-generic-pae #50-Ubuntu
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552050] Call Trace:
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552060] [<c01e50da>] dump_header+0x7a/0xb0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552065] [<c01e516c>] oom_kill_process+0x5c/0x160
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552069] [<c01e56d9>] ? select_bad_process+0xa9/0xe0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552073] [<c01e5761>] __out_of_memory+0x51/0xb0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552077] [<c01e5818>] out_of_memory+0x58/0xd0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552081] [<c01e86e6>] __alloc_pages_slowpath+0x496/0x4b0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552086] [<c01e886f>] __alloc_pages_nodemask+0x16f/0x1c0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552090] [<c01e88dc>] __get_free_pages+0x1c/0x30
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552096] [<c01505ad>] dup_task_struct+0x3d/0x130
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552101] [<c0226df3>] ? cp_new_stat64+0xe3/0x100
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552105] [<c0151338>] copy_process+0x88/0xc70
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552109] [<c0151fa3>] do_fork+0x83/0x3a0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552114] [<c0223098>] ? vfs_write+0x128/0x190
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552120] [<c0110414>] sys_clone+0x34/0x40
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552123] [<c0109519>] ptregs_clone+0x15/0x3c
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552128] [<c01093df>] ? sysenter_do_call+0x12/0x28
Can anyone help us?
Thank you in advanced.
Helio.-
It´s very strange because ram memory is used 25% only continualy. CPU's consume is very little (1%). The hardware is a HP Proliant DL360 with 4GB ram and dual core. Whe don´t believe it!!
The kernel is 2.6.35-28-generic-pae, distro Ubuntu 10.10, and Nagios 3.2.3.
Nagios send pings and snmpget to 44 different machiches, to konw if it is up or not by ping, and some details of the up or down state by snmp.
The crash looks like this:
Sep 4 16:07:29 segur-ProLiant-DL360-G6 snmptt[0]: .1.3.6.1.4.1.248.32.16.9.0.0.58 Normal "Status Events" 10.101.1.117 - Cl.-Connected Hirschmann BAT54-F Client+ 8.00.0167 / 29.06.2010 959117701130 UT_117 1270 WLAN-1 3
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552037] nagios invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552043] nagios cpuset=/ mems_allowed=0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552047] Pid: 1139, comm: nagios Not tainted 2.6.35-28-generic-pae #50-Ubuntu
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552050] Call Trace:
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552060] [<c01e50da>] dump_header+0x7a/0xb0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552065] [<c01e516c>] oom_kill_process+0x5c/0x160
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552069] [<c01e56d9>] ? select_bad_process+0xa9/0xe0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552073] [<c01e5761>] __out_of_memory+0x51/0xb0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552077] [<c01e5818>] out_of_memory+0x58/0xd0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552081] [<c01e86e6>] __alloc_pages_slowpath+0x496/0x4b0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552086] [<c01e886f>] __alloc_pages_nodemask+0x16f/0x1c0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552090] [<c01e88dc>] __get_free_pages+0x1c/0x30
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552096] [<c01505ad>] dup_task_struct+0x3d/0x130
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552101] [<c0226df3>] ? cp_new_stat64+0xe3/0x100
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552105] [<c0151338>] copy_process+0x88/0xc70
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552109] [<c0151fa3>] do_fork+0x83/0x3a0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552114] [<c0223098>] ? vfs_write+0x128/0x190
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552120] [<c0110414>] sys_clone+0x34/0x40
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552123] [<c0109519>] ptregs_clone+0x15/0x3c
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552128] [<c01093df>] ? sysenter_do_call+0x12/0x28
Can anyone help us?
Thank you in advanced.
Helio.-