After 12 or 14 days working ok my nagios server crash. When the server begin not working properly, it show me in /var/log/messages the kernel OOM Killer, killing a lot of proccesses as nagios. The first one who is killed by "out of memory" is nagios, so I suspect nagios has a bug or the bug is in the kernel.
It´s very strange because ram memory is used 25% only continualy. CPU's consume is very little (1%). The hardware is a HP Proliant DL360 with 4GB ram and dual core. Whe don´t believe it!!
The kernel is 2.6.35-28-generic-pae, distro Ubuntu 10.10, and Nagios 3.2.3.
Nagios send pings and snmpget to 44 different machiches, to konw if it is up or not by ping, and some details of the up or down state by snmp.
The crash looks like this:
Sep 4 16:07:29 segur-ProLiant-DL360-G6 snmptt[0]: .1.3.6.1.4.1.248.32.16.9.0.0.58 Normal "Status Events" 10.101.1.117 - Cl.-Connected Hirschmann BAT54-F Client+ 8.00.0167 / 29.06.2010 959117701130 UT_117 1270 WLAN-1 3
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552037] nagios invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552043] nagios cpuset=/ mems_allowed=0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552047] Pid: 1139, comm: nagios Not tainted 2.6.35-28-generic-pae #50-Ubuntu
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552050] Call Trace:
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552060] [<c01e50da>] dump_header+0x7a/0xb0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552065] [<c01e516c>] oom_kill_process+0x5c/0x160
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552069] [<c01e56d9>] ? select_bad_process+0xa9/0xe0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552073] [<c01e5761>] __out_of_memory+0x51/0xb0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552077] [<c01e5818>] out_of_memory+0x58/0xd0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552081] [<c01e86e6>] __alloc_pages_slowpath+0x496/0x4b0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552086] [<c01e886f>] __alloc_pages_nodemask+0x16f/0x1c0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552090] [<c01e88dc>] __get_free_pages+0x1c/0x30
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552096] [<c01505ad>] dup_task_struct+0x3d/0x130
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552101] [<c0226df3>] ? cp_new_stat64+0xe3/0x100
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552105] [<c0151338>] copy_process+0x88/0xc70
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552109] [<c0151fa3>] do_fork+0x83/0x3a0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552114] [<c0223098>] ? vfs_write+0x128/0x190
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552120] [<c0110414>] sys_clone+0x34/0x40
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552123] [<c0109519>] ptregs_clone+0x15/0x3c
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552128] [<c01093df>] ? sysenter_do_call+0x12/0x28
Can anyone help us?
Thank you in advanced.
Helio.-
OOM-Killer kill nagios process
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: OOM-Killer kill nagios process
I'm not sure what the reference to SNMP was, is this only happening when your SNMP checks go out? Or traps come in?
Re: OOM-Killer kill nagios process
Thank you,
we have disabled snmptrad daemon not to attend any trap packect. The server use check_ping and check_snmp to query the machines directly. It happends at any time. But the two times that we know, the server worked 12 and 14 days. Surprisingly the kernel start killing procceses slowly....until we lost control....and the only way is power off. It´s very difficult find our problem....when we see normal values in sar:
00:00:01 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit
00:05:01 2838952 1275052 30,99 194664 323732 788544 12,95
00:15:01 2837668 1276336 31,02 194664 323740 818256 13,44
00:25:01 2836060 1277944 31,06 194668 323760 818256 13,44
00:35:01 2833332 1280672 31,13 194680 323756 907392 14,91
00:45:01 2834744 1279260 31,10 194700 323760 818256 13,44
...
Any idea? A clue can be when we see in sar kbbuffers going down....until 26000kb....?
Thank you in advanced.
Helio,.
we have disabled snmptrad daemon not to attend any trap packect. The server use check_ping and check_snmp to query the machines directly. It happends at any time. But the two times that we know, the server worked 12 and 14 days. Surprisingly the kernel start killing procceses slowly....until we lost control....and the only way is power off. It´s very difficult find our problem....when we see normal values in sar:
00:00:01 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit
00:05:01 2838952 1275052 30,99 194664 323732 788544 12,95
00:15:01 2837668 1276336 31,02 194664 323740 818256 13,44
00:25:01 2836060 1277944 31,06 194668 323760 818256 13,44
00:35:01 2833332 1280672 31,13 194680 323756 907392 14,91
00:45:01 2834744 1279260 31,10 194700 323760 818256 13,44
...
Any idea? A clue can be when we see in sar kbbuffers going down....until 26000kb....?
Thank you in advanced.
Helio,.
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: OOM-Killer kill nagios process
Actually, this error usually comes up if the machine is running our of memory.
How much memory is on the machine?
What version of Nagios Core are you using?
Do you have enable_embedded_perl=1 in the nagios.cfg?
How much memory is on the machine?
What version of Nagios Core are you using?
Do you have enable_embedded_perl=1 in the nagios.cfg?
Re: OOM-Killer kill nagios process
Hello Scott,
The machine has 4GB of RAM. The Nagios Core version is 3.2.3. In nagios.cfg we have enable_embedded_perl=1.
It´s very strange why is running out of memory having 3GB free. Any idea?
Thank you in advanced.
Helio.-
The machine has 4GB of RAM. The Nagios Core version is 3.2.3. In nagios.cfg we have enable_embedded_perl=1.
It´s very strange why is running out of memory having 3GB free. Any idea?
Thank you in advanced.
Helio.-
Re: OOM-Killer kill nagios process
Try disabling embedded perl in nagios.cfg:
and see if this is going to fix your issue.
Code: Select all
enable_embedded_perl=0
use_embedded_perl_implicitly=0Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: OOM-Killer kill nagios process
Thank you lmiltchev!
I will try as soon as possible (now I am waiting for last change I made, where I configured OOM Killer to reboot in an out-of-memory condition).
Let me do a question about your suggestion. My Nagios only uses check_ping and check_snmp plugins, both are compiled for Linux machine. Disabling embedded perl could have any relation with my plugins? Or maybe are used in another part?
Thank you in advanced.
Helio.-
I will try as soon as possible (now I am waiting for last change I made, where I configured OOM Killer to reboot in an out-of-memory condition).
Let me do a question about your suggestion. My Nagios only uses check_ping and check_snmp plugins, both are compiled for Linux machine. Disabling embedded perl could have any relation with my plugins? Or maybe are used in another part?
Thank you in advanced.
Helio.-
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: OOM-Killer kill nagios process
You should be able to execute these plugins just fine without embedded perl enabled.