public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Out of memory, but no OOM Killer? (2.6.9-ac11)
@ 2004-11-26 22:47 Ralf Hildebrandt
  2004-11-26 23:56 ` Nick Piggin
  2004-11-28 17:29 ` Alan Cox
  0 siblings, 2 replies; 6+ messages in thread
From: Ralf Hildebrandt @ 2004-11-26 22:47 UTC (permalink / raw)
  To: linux-kernel

rsync seems to want lots of memory, yet the OOM killer doesn't strike.
Subsequently, that machine died an ugly death until delivered by a
power-cycle.

Why doesn't the OOM killer reap rsync?
...
Nov 26 05:58:19 backup-in -- MARK --
Nov 26 06:02:04 backup-in kernel: rsync: page allocation failure. order:3, mode:0x20
Nov 26 06:02:04 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:02:04 backup-in kernel: rsync: page allocation failure. order:3, mode:0x20
Nov 26 06:02:04 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:02:08 backup-in kernel: rsync: page allocation failure. order:3, mode:0x20
Nov 26 06:02:08 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:02:08 backup-in kernel: rsync: page allocation failure. order:3, mode:0x20
Nov 26 06:02:08 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:02:08 backup-in kernel: rsync: page allocation failure. order:3, mode:0x20
Nov 26 06:02:08 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:02:08 backup-in kernel: rsync: page allocation failure. order:3, mode:0x20
Nov 26 06:02:08 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:02:08 backup-in kernel: rsync: page allocation failure. order:3, mode:0x20
Nov 26 06:02:08 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:02:08 backup-in kernel: rsync: page allocation failure. order:3, mode:0x20
Nov 26 06:02:08 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:02:08 backup-in kernel: rsync: page allocation failure. order:3, mode:0x20
Nov 26 06:02:08 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:02:08 backup-in kernel: rsync: page allocation failure. order:3, mode:0x20
Nov 26 06:02:08 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:02:17 backup-in kernel: printk: 32 messages suppressed.
Nov 26 06:02:17 backup-in kernel: pdflush: page allocation failure. order:3, mode:0x20
Nov 26 06:02:17 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:02:17 backup-in kernel: pdflush: page allocation failure. order:3, mode:0x20
Nov 26 06:02:17 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:03:33 backup-in kernel: printk: 11 messages suppressed.
Nov 26 06:03:33 backup-in kernel: pdflush: page allocation failure. order:3, mode:0x20
Nov 26 06:03:33 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:03:33 backup-in kernel: pdflush: page allocation failure. order:3, mode:0x20
Nov 26 06:03:33 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:03:33 backup-in kernel: pdflush: page allocation failure. order:3, mode:0x20
Nov 26 06:03:33 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:03:33 backup-in kernel: pdflush: page allocation failure. order:3, mode:0x20
Nov 26 06:03:33 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:03:33 backup-in kernel: pdflush: page allocation failure. order:3, mode:0x20
Nov 26 06:03:33 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:03:33 backup-in kernel: pdflush: page allocation failure. order:3, mode:0x20
Nov 26 06:03:33 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:03:33 backup-in kernel: pdflush: page allocation failure. order:3, mode:0x20
Nov 26 06:03:33 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:03:33 backup-in kernel: pdflush: page allocation failure. order:3, mode:0x20
Nov 26 06:03:33 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:04:08 backup-in kernel: swapper: page allocation failure. order:3, mode:0x20
Nov 26 06:04:08 backup-in kernel:  [__alloc_pages+576/908] __alloc_pages+0x240/0x38c
Nov 26 06:04:08 backup-in kernel:  [__get_free_pages+24/49] __get_free_pages+0x18/0x31
Nov 26 06:04:08 backup-in kernel:  [kmem_getpages+30/201] kmem_getpages+0x1e/0xc9
Nov 26 06:04:08 backup-in kernel:  [cache_grow+188/390] cache_grow+0xbc/0x186
Nov 26 06:04:08 backup-in kernel:  [cache_alloc_refill+468/532] cache_alloc_refill+0x1d4/0x214
Nov 26 06:04:08 backup-in kernel:  [__kmalloc+122/124] __kmalloc+0x7a/0x7c
Nov 26 06:04:08 backup-in kernel:  [alloc_skb+83/252] alloc_skb+0x53/0xfc
Nov 26 06:04:08 backup-in kernel:  [pg0+545011501/1068684288] e1000_alloc_rx_buffers+0x47/0xe8 [e1000]
Nov 26 06:04:08 backup-in kernel:  [pg0+545010700/1068684288] e1000_clean_rx_irq+0x184/0x45e [e1000]
Nov 26 06:04:08 backup-in kernel:  [pg0+545010212/1068684288] e1000_clean_tx_irq+0x1a0/0x204 [e1000]
Nov 26 06:04:08 backup-in kernel:  [pg0+545009728/1068684288] e1000_intr+0x3e/0x82 [e1000]
Nov 26 06:04:08 backup-in kernel:  [handle_IRQ_event+60/110] handle_IRQ_event+0x3c/0x6e
Nov 26 06:04:08 backup-in kernel:  [do_IRQ+222/422] do_IRQ+0xde/0x1a6
Nov 26 06:04:08 backup-in kernel:  =======================
Nov 26 06:04:08 backup-in kernel:  [common_interrupt+24/32] common_interrupt+0x18/0x20
Nov 26 06:04:08 backup-in kernel:  [default_idle+0/44] default_idle+0x0/0x2c
Nov 26 06:04:08 backup-in kernel:  [default_idle+41/44] default_idle+0x29/0x2c
Nov 26 06:04:08 backup-in kernel:  [cpu_idle+46/60] cpu_idle+0x2e/0x3c
Nov 26 06:04:08 backup-in kernel:  [start_kernel+356/383] start_kernel+0x164/0x17f
Nov 26 06:04:08 backup-in kernel:  [unknown_bootoption+0/374] unknown_bootoption+0x0/0x176
Nov 26 06:04:08 backup-in kernel: swapper: page allocation failure. order:3, mode:0x20
Nov 26 06:04:08 backup-in kernel:  [__alloc_pages+576/908] __alloc_pages+0x240/0x38c
Nov 26 06:04:08 backup-in kernel:  [__get_free_pages+24/49] __get_free_pages+0x18/0x31
Nov 26 06:04:08 backup-in kernel:  [kmem_getpages+30/201] kmem_getpages+0x1e/0xc9
Nov 26 06:04:08 backup-in kernel:  [cache_grow+188/390] cache_grow+0xbc/0x186
Nov 26 06:04:08 backup-in kernel:  [cache_alloc_refill+468/532] cache_alloc_refill+0x1d4/0x214
Nov 26 06:04:08 backup-in kernel:  [__kmalloc+122/124] __kmalloc+0x7a/0x7c
Nov 26 06:04:08 backup-in kernel:  [alloc_skb+83/252] alloc_skb+0x53/0xfc
Nov 26 06:04:08 backup-in kernel:  [pg0+545011501/1068684288] e1000_alloc_rx_buffers+0x47/0xe8 [e1000]
Nov 26 06:04:08 backup-in kernel:  [pg0+545010700/1068684288] e1000_clean_rx_irq+0x184/0x45e [e1000]
Nov 26 06:04:08 backup-in kernel:  [pg0+545010212/1068684288] e1000_clean_tx_irq+0x1a0/0x204 [e1000]
Nov 26 06:04:08 backup-in kernel:  [pg0+545009728/1068684288] e1000_intr+0x3e/0x82 [e1000]
Nov 26 06:04:08 backup-in kernel:  [handle_IRQ_event+60/110] handle_IRQ_event+0x3c/0x6e
Nov 26 06:04:08 backup-in kernel:  [do_IRQ+222/422] do_IRQ+0xde/0x1a6
Nov 26 06:04:08 backup-in kernel:  =======================
Nov 26 06:04:08 backup-in kernel:  [common_interrupt+24/32] common_interrupt+0x18/0x20
Nov 26 06:04:08 backup-in kernel:  [default_idle+0/44] default_idle+0x0/0x2c
Nov 26 06:04:08 backup-in kernel:  [default_idle+41/44] default_idle+0x29/0x2c
Nov 26 06:04:08 backup-in kernel:  [cpu_idle+46/60] cpu_idle+0x2e/0x3c
Nov 26 06:04:08 backup-in kernel:  [start_kernel+356/383] start_kernel+0x164/0x17f
Nov 26 06:04:08 backup-in kernel:  [unknown_bootoption+0/374] unknown_bootoption+0x0/0x176
Nov 26 06:04:08 backup-in kernel: klogd: page allocation failure. order:3, mode:0x20
Nov 26 06:04:08 backup-in kernel: Stack pointer is garbage, not printing trace
Nov 26 06:04:08 backup-in kernel: klogd: page allocation failure. order:3, mode:0x20

... and it goes on like that for hours...

-- 
Ralf Hildebrandt (i.A. des IT-Zentrum)          Ralf.Hildebrandt@charite.de
Charite - Universitätsmedizin Berlin            Tel.  +49 (0)30-450 570-155
Gemeinsame Einrichtung von FU- und HU-Berlin    Fax.  +49 (0)30-450 570-962
IT-Zentrum Standort CBF                 send no mail to spamtrap@charite.de

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Out of memory, but no OOM Killer? (2.6.9-ac11)
  2004-11-26 22:47 Out of memory, but no OOM Killer? (2.6.9-ac11) Ralf Hildebrandt
@ 2004-11-26 23:56 ` Nick Piggin
  2004-11-27  0:33   ` Ralf Hildebrandt
  2004-11-28 17:29 ` Alan Cox
  1 sibling, 1 reply; 6+ messages in thread
From: Nick Piggin @ 2004-11-26 23:56 UTC (permalink / raw)
  To: Ralf Hildebrandt; +Cc: linux-kernel

Ralf Hildebrandt wrote:
> rsync seems to want lots of memory, yet the OOM killer doesn't strike.
> Subsequently, that machine died an ugly death until delivered by a
> power-cycle.
> 
> Why doesn't the OOM killer reap rsync?

This could be the problem where fragmented memory causes atomic higher
order allocations to fail, for which there is a fix in -mm, which should
make its way into 2.6.11.

Also, the increased atomic memory reserves in current 2.6-bk should
alleviate the problem.

As a temporary workaround, you can increase /proc/sys/vm/min_free_kbytes

BTW. what does `free` say when the allocation failures are happening?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Out of memory, but no OOM Killer? (2.6.9-ac11)
  2004-11-26 23:56 ` Nick Piggin
@ 2004-11-27  0:33   ` Ralf Hildebrandt
  2004-11-27  1:10     ` Nick Piggin
  0 siblings, 1 reply; 6+ messages in thread
From: Ralf Hildebrandt @ 2004-11-27  0:33 UTC (permalink / raw)
  To: linux-kernel

* Nick Piggin <nickpiggin@yahoo.com.au>:

> This could be the problem where fragmented memory causes atomic higher
> order allocations to fail, for which there is a fix in -mm, which should
> make its way into 2.6.11.

I see. rsync requested a big chunk of memory, but failed due to the
fragmentation of free memory? my "sar" output shows lots of free memory and
lots of unused swap:

03:55:01    kbmemfree kbmemused  %memused kbbuffers  kbcached kbswpfree kbswpused  %swpused  kbswpcad
04:55:01         2852    512048     99.45      2340    466792   2008084 0    0.00         0
05:05:02         2328    512572     99.55       500    467884   2008084 0    0.00         0
05:15:02         2008    512892     99.61       536    468664   2008084 0    0.00         0
05:25:01        31152    483748     93.95      1804    436728   2008084 0    0.00         0
05:35:01         4056    510844     99.21      5668    452856   2008084 0    0.00         0
05:45:01       129908    384992     74.77       836    305232   2008084 0    0.00         0
05:55:02         2376    512524     99.54       444    464200   2008084 0    0.00         0
06:05:02         1952    512948     99.62       608    465736   2008084 0    0.00         0
06:15:02         1576    513324     99.69       424    465580   2008084 0    0.00         0

for reference:
Nov 26 05:58:19 backup-in -- MARK --
Nov 26 06:02:04 backup-in kernel: rsync: page allocation failure. order:3, mode:0x20

> Also, the increased atomic memory reserves in current 2.6-bk should
> alleviate the problem.

2.4.27 doesn't seem to expose the problem either
 
> As a temporary workaround, you can increase /proc/sys/vm/min_free_kbytes

# cat /proc/sys/vm/min_free_kbytes
724

I increased that to 7240 now.

> BTW. what does `free` say when the allocation failures are happening?

see sar output above.

-- 
Ralf Hildebrandt (i.A. des IT-Zentrum)          Ralf.Hildebrandt@charite.de
Charite - Universitätsmedizin Berlin            Tel.  +49 (0)30-450 570-155
Gemeinsame Einrichtung von FU- und HU-Berlin    Fax.  +49 (0)30-450 570-962
IT-Zentrum Standort CBF                 send no mail to spamtrap@charite.de

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Out of memory, but no OOM Killer? (2.6.9-ac11)
  2004-11-27  0:33   ` Ralf Hildebrandt
@ 2004-11-27  1:10     ` Nick Piggin
  2004-11-27  8:03       ` Ralf Hildebrandt
  0 siblings, 1 reply; 6+ messages in thread
From: Nick Piggin @ 2004-11-27  1:10 UTC (permalink / raw)
  To: Ralf Hildebrandt; +Cc: linux-kernel

Ralf Hildebrandt wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au>:
> 
> 
>>This could be the problem where fragmented memory causes atomic higher
>>order allocations to fail, for which there is a fix in -mm, which should
>>make its way into 2.6.11.
> 
> 
> I see. rsync requested a big chunk of memory, but failed due to the
> fragmentation of free memory? my "sar" output shows lots of free memory and
> lots of unused swap:
> 

Basically, yes. Well not *exactly* rsync - your network drivers. I guess
rsync is showing up in process context most often because that is the
process causing most of the network activity.

Yep, it looks like fragmentation is indeed the problem here. See you have
a lot of memory that is able to be reclaimed, but the failing allocations
themselves can't reclaim any of it because they are happening from
interrupts. What they should be doing is telling `kswapd` to start freeing
memory for them - however this currently doesn't happen properly for
allocations which are order greater than 0.

Fortunately that is usually not a big problem, but as you have seen, it
can be. Anyway, expect 2.6.10 to be better (ie. good enough), and 2.6.11
should have even more complete fixes.

> 
>>As a temporary workaround, you can increase /proc/sys/vm/min_free_kbytes
> 
> 
> # cat /proc/sys/vm/min_free_kbytes
> 724
> 
> I increased that to 7240 now.
> 

OK that should be fine. If you should upgrade to a 2.6.10 or later kernel,
put this value back to the default, and report further problems if they
occur.

> 
>>BTW. what does `free` say when the allocation failures are happening?
> 
> 
> see sar output above.
> 

Thanks Ralf.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Out of memory, but no OOM Killer? (2.6.9-ac11)
  2004-11-27  1:10     ` Nick Piggin
@ 2004-11-27  8:03       ` Ralf Hildebrandt
  0 siblings, 0 replies; 6+ messages in thread
From: Ralf Hildebrandt @ 2004-11-27  8:03 UTC (permalink / raw)
  To: linux-kernel

* Nick Piggin <nickpiggin@yahoo.com.au>:

> >I see. rsync requested a big chunk of memory, but failed due to the
> >fragmentation of free memory? my "sar" output shows lots of free memory and
> >lots of unused swap:
> >
> 
> Basically, yes. Well not *exactly* rsync - your network drivers. I guess
> rsync is showing up in process context most often because that is the
> process causing most of the network activity.

At that time, yes.

> Yep, it looks like fragmentation is indeed the problem here. See you have
> a lot of memory that is able to be reclaimed, but the failing allocations
> themselves can't reclaim any of it because they are happening from
> interrupts. What they should be doing is telling `kswapd` to start freeing
> memory for them - however this currently doesn't happen properly for
> allocations which are order greater than 0.
> 
> Fortunately that is usually not a big problem, but as you have seen, it
> can be. Anyway, expect 2.6.10 to be better (ie. good enough), and 2.6.11
> should have even more complete fixes.

Aha.

> OK that should be fine. If you should upgrade to a 2.6.10 or later kernel,
> put this value back to the default, and report further problems if they
> occur.

I just set it "for now"

-- 
Ralf Hildebrandt (i.A. des IT-Zentrum)          Ralf.Hildebrandt@charite.de
Charite - Universitätsmedizin Berlin            Tel.  +49 (0)30-450 570-155
Gemeinsame Einrichtung von FU- und HU-Berlin    Fax.  +49 (0)30-450 570-962
IT-Zentrum Standort CBF                 send no mail to spamtrap@charite.de

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Out of memory, but no OOM Killer? (2.6.9-ac11)
  2004-11-26 22:47 Out of memory, but no OOM Killer? (2.6.9-ac11) Ralf Hildebrandt
  2004-11-26 23:56 ` Nick Piggin
@ 2004-11-28 17:29 ` Alan Cox
  1 sibling, 0 replies; 6+ messages in thread
From: Alan Cox @ 2004-11-28 17:29 UTC (permalink / raw)
  To: Ralf Hildebrandt; +Cc: Linux Kernel Mailing List

On Gwe, 2004-11-26 at 22:47, Ralf Hildebrandt wrote:
> rsync seems to want lots of memory, yet the OOM killer doesn't strike.
> Subsequently, that machine died an ugly death until delivered by a
> power-cycle.
> 
> Why doesn't the OOM killer reap rsync?

The OOM killer is a very dumb (near useless) heuristic. You can turn it
off or switch to sane overcommit module by setting
/proc/sys/vm/overcommit_memory to
1 or 2 respectively.

In this case your bug looks like the 2.6.9 network problem in which case
if you are lucky -ac12 will have fixed it as I merged the stuff DaveM
recommended into that.

Alan


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-11-28 18:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-11-26 22:47 Out of memory, but no OOM Killer? (2.6.9-ac11) Ralf Hildebrandt
2004-11-26 23:56 ` Nick Piggin
2004-11-27  0:33   ` Ralf Hildebrandt
2004-11-27  1:10     ` Nick Piggin
2004-11-27  8:03       ` Ralf Hildebrandt
2004-11-28 17:29 ` Alan Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox