OOM-Killer kill nagios process

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Helio
Posts: 9
Joined: Tue Sep 10, 2013 2:08 am

OOM-Killer kill nagios process

Post by Helio »

After 12 or 14 days working ok my nagios server crash. When the server begin not working properly, it show me in /var/log/messages the kernel OOM Killer, killing a lot of proccesses as nagios. The first one who is killed by "out of memory" is nagios, so I suspect nagios has a bug or the bug is in the kernel.
It´s very strange because ram memory is used 25% only continualy. CPU's consume is very little (1%). The hardware is a HP Proliant DL360 with 4GB ram and dual core. Whe don´t believe it!!
The kernel is 2.6.35-28-generic-pae, distro Ubuntu 10.10, and Nagios 3.2.3.
Nagios send pings and snmpget to 44 different machiches, to konw if it is up or not by ping, and some details of the up or down state by snmp.
The crash looks like this:

Sep 4 16:07:29 segur-ProLiant-DL360-G6 snmptt[0]: .1.3.6.1.4.1.248.32.16.9.0.0.58 Normal "Status Events" 10.101.1.117 - Cl.-Connected Hirschmann BAT54-F Client+ 8.00.0167 / 29.06.2010 959117701130 UT_117 1270 WLAN-1 3
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552037] nagios invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552043] nagios cpuset=/ mems_allowed=0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552047] Pid: 1139, comm: nagios Not tainted 2.6.35-28-generic-pae #50-Ubuntu
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552050] Call Trace:
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552060] [<c01e50da>] dump_header+0x7a/0xb0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552065] [<c01e516c>] oom_kill_process+0x5c/0x160
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552069] [<c01e56d9>] ? select_bad_process+0xa9/0xe0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552073] [<c01e5761>] __out_of_memory+0x51/0xb0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552077] [<c01e5818>] out_of_memory+0x58/0xd0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552081] [<c01e86e6>] __alloc_pages_slowpath+0x496/0x4b0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552086] [<c01e886f>] __alloc_pages_nodemask+0x16f/0x1c0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552090] [<c01e88dc>] __get_free_pages+0x1c/0x30
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552096] [<c01505ad>] dup_task_struct+0x3d/0x130
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552101] [<c0226df3>] ? cp_new_stat64+0xe3/0x100
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552105] [<c0151338>] copy_process+0x88/0xc70
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552109] [<c0151fa3>] do_fork+0x83/0x3a0
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552114] [<c0223098>] ? vfs_write+0x128/0x190
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552120] [<c0110414>] sys_clone+0x34/0x40
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552123] [<c0109519>] ptregs_clone+0x15/0x3c
Sep 4 16:07:30 segur-ProLiant-DL360-G6 kernel: [1136138.552128] [<c01093df>] ? sysenter_do_call+0x12/0x28

Can anyone help us?

Thank you in advanced.

Helio.-
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: OOM-Killer kill nagios process

Post by slansing »

I'm not sure what the reference to SNMP was, is this only happening when your SNMP checks go out? Or traps come in?
Helio
Posts: 9
Joined: Tue Sep 10, 2013 2:08 am

Re: OOM-Killer kill nagios process

Post by Helio »

Thank you,
we have disabled snmptrad daemon not to attend any trap packect. The server use check_ping and check_snmp to query the machines directly. It happends at any time. But the two times that we know, the server worked 12 and 14 days. Surprisingly the kernel start killing procceses slowly....until we lost control....and the only way is power off. It´s very difficult find our problem....when we see normal values in sar:

00:00:01 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit
00:05:01 2838952 1275052 30,99 194664 323732 788544 12,95
00:15:01 2837668 1276336 31,02 194664 323740 818256 13,44
00:25:01 2836060 1277944 31,06 194668 323760 818256 13,44
00:35:01 2833332 1280672 31,13 194680 323756 907392 14,91
00:45:01 2834744 1279260 31,10 194700 323760 818256 13,44
...

Any idea? A clue can be when we see in sar kbbuffers going down....until 26000kb....?

Thank you in advanced.
Helio,.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: OOM-Killer kill nagios process

Post by scottwilkerson »

Actually, this error usually comes up if the machine is running our of memory.

How much memory is on the machine?

What version of Nagios Core are you using?

Do you have enable_embedded_perl=1 in the nagios.cfg?
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Helio
Posts: 9
Joined: Tue Sep 10, 2013 2:08 am

Re: OOM-Killer kill nagios process

Post by Helio »

Hello Scott,
The machine has 4GB of RAM. The Nagios Core version is 3.2.3. In nagios.cfg we have enable_embedded_perl=1.
It´s very strange why is running out of memory having 3GB free. Any idea?
Thank you in advanced.
Helio.-
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: OOM-Killer kill nagios process

Post by lmiltchev »

Try disabling embedded perl in nagios.cfg:

Code: Select all

enable_embedded_perl=0
use_embedded_perl_implicitly=0
and see if this is going to fix your issue.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Helio
Posts: 9
Joined: Tue Sep 10, 2013 2:08 am

Re: OOM-Killer kill nagios process

Post by Helio »

Thank you lmiltchev!
I will try as soon as possible (now I am waiting for last change I made, where I configured OOM Killer to reboot in an out-of-memory condition).
Let me do a question about your suggestion. My Nagios only uses check_ping and check_snmp plugins, both are compiled for Linux machine. Disabling embedded perl could have any relation with my plugins? Or maybe are used in another part?
Thank you in advanced.
Helio.-
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: OOM-Killer kill nagios process

Post by slansing »

You should be able to execute these plugins just fine without embedded perl enabled.
Locked