[RFC] Reproducible OOM with partial workaround

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] Reproducible OOM with partial workaround
@ 2013-01-10 21:58 ` paul.szabo
  0 siblings, 0 replies; 22+ messages in thread
From: paul.szabo @ 2013-01-10 21:58 UTC (permalink / raw)
  To: linux-mm; +Cc: 695182, linux-kernel

Dear Linux-MM,

On a machine with i386 kernel and over 32GB RAM, an OOM condition is
reliably obtained simply by writing a few files to some local disk
e.g. with:
  n=0; while [ $n -lt 99 ]; do dd bs=1M count=1024 if=/dev/zero of=x$n; ((n=$n+1)); done
Crash usually occurs after 16 or 32 files written. Seems that the
problem may be avoided by using mem=32G on the kernel boot, and that
it occurs with any amount of RAM over 32GB.

I developed a workaround patch for this particular OOM demo, dropping
filesystem caches when about to exhaust lowmem. However, subsequently
I observed OOM when running many processes (as yet I do not have an
easy-to-reproduce demo of this); so as I suspected, the essence of the
problem is not with FS caches.

Could you please help in finding the cause of this OOM bug?

Please see
http://bugs.debian.org/695182
for details, in particular my workaround patch
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=101;att=1;bug=695182

(Please reply to me directly, as I am not a subscriber to the linux-mm
mailing list.)

Thanks, Paul

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC] Reproducible OOM with partial workaround
@ 2013-01-10 21:58 ` paul.szabo
  0 siblings, 0 replies; 22+ messages in thread
From: paul.szabo @ 2013-01-10 21:58 UTC (permalink / raw)
  To: linux-mm; +Cc: 695182, linux-kernel

Dear Linux-MM,

On a machine with i386 kernel and over 32GB RAM, an OOM condition is
reliably obtained simply by writing a few files to some local disk
e.g. with:
  n=0; while [ $n -lt 99 ]; do dd bs=1M count=1024 if=/dev/zero of=x$n; ((n=$n+1)); done
Crash usually occurs after 16 or 32 files written. Seems that the
problem may be avoided by using mem=32G on the kernel boot, and that
it occurs with any amount of RAM over 32GB.

I developed a workaround patch for this particular OOM demo, dropping
filesystem caches when about to exhaust lowmem. However, subsequently
I observed OOM when running many processes (as yet I do not have an
easy-to-reproduce demo of this); so as I suspected, the essence of the
problem is not with FS caches.

Could you please help in finding the cause of this OOM bug?

Please see
http://bugs.debian.org/695182
for details, in particular my workaround patch
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=101;att=1;bug=695182

(Please reply to me directly, as I am not a subscriber to the linux-mm
mailing list.)

Thanks, Paul

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Reproducible OOM with partial workaround
  2013-01-10 21:58 ` paul.szabo
@ 2013-01-10 23:12   ` Dave Hansen
  -1 siblings, 0 replies; 22+ messages in thread
From: Dave Hansen @ 2013-01-10 23:12 UTC (permalink / raw)
  To: paul.szabo; +Cc: linux-mm, 695182, linux-kernel

On 01/10/2013 01:58 PM, paul.szabo@sydney.edu.au wrote:
> I developed a workaround patch for this particular OOM demo, dropping
> filesystem caches when about to exhaust lowmem. However, subsequently
> I observed OOM when running many processes (as yet I do not have an
> easy-to-reproduce demo of this); so as I suspected, the essence of the
> problem is not with FS caches.
> 
> Could you please help in finding the cause of this OOM bug?

As was mentioned in the bug, your 32GB of physical memory only ends up
giving ~900MB of low memory to the kernel.  Of that, around 600MB is
used for "mem_map[]", leaving only about 300MB available to the kernel
for *ALL* of its allocations at runtime.

Your configuration has never worked.  This isn't a regression, it's
simply something that we know never worked in Linux and it's a very hard
problem to solve.  One Linux vendor (at least) went to a huge amount of
trouble to develop, ship, and supported a kernel that supported large
32-bit machines, but it was never merged upstream and work stopped on it
when such machines became rare beasts:

	http://lwn.net/Articles/39925/

I believe just about any Linux vendor would call your configuration
"unsupported".  Just because the kernel can boot does not mean that we
expect it to work.

It's possible that some tweaks of the vm knobs (like lowmem_reserve)
could help you here.  But, really, you don't want to run a 32-bit kernel
on such a large machine.  Very, very few folks are running 32-bit
kernels on these systems and you're likely to keep running in to bugs
because this is such a rare configuration.

We've been very careful to ensure that 64-bit kernels shoul basically be
drop-in replacements for 32-bit ones.  You can keep userspace 100%
32-bit, and just have a 64-bit kernel.

If you're really set on staying 32-bit, I might have a NUMA-Q I can give
you. ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Reproducible OOM with partial workaround
@ 2013-01-10 23:12   ` Dave Hansen
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Hansen @ 2013-01-10 23:12 UTC (permalink / raw)
  To: paul.szabo; +Cc: linux-mm, 695182, linux-kernel

On 01/10/2013 01:58 PM, paul.szabo@sydney.edu.au wrote:
> I developed a workaround patch for this particular OOM demo, dropping
> filesystem caches when about to exhaust lowmem. However, subsequently
> I observed OOM when running many processes (as yet I do not have an
> easy-to-reproduce demo of this); so as I suspected, the essence of the
> problem is not with FS caches.
> 
> Could you please help in finding the cause of this OOM bug?

As was mentioned in the bug, your 32GB of physical memory only ends up
giving ~900MB of low memory to the kernel.  Of that, around 600MB is
used for "mem_map[]", leaving only about 300MB available to the kernel
for *ALL* of its allocations at runtime.

Your configuration has never worked.  This isn't a regression, it's
simply something that we know never worked in Linux and it's a very hard
problem to solve.  One Linux vendor (at least) went to a huge amount of
trouble to develop, ship, and supported a kernel that supported large
32-bit machines, but it was never merged upstream and work stopped on it
when such machines became rare beasts:

	http://lwn.net/Articles/39925/

I believe just about any Linux vendor would call your configuration
"unsupported".  Just because the kernel can boot does not mean that we
expect it to work.

It's possible that some tweaks of the vm knobs (like lowmem_reserve)
could help you here.  But, really, you don't want to run a 32-bit kernel
on such a large machine.  Very, very few folks are running 32-bit
kernels on these systems and you're likely to keep running in to bugs
because this is such a rare configuration.

We've been very careful to ensure that 64-bit kernels shoul basically be
drop-in replacements for 32-bit ones.  You can keep userspace 100%
32-bit, and just have a 64-bit kernel.

If you're really set on staying 32-bit, I might have a NUMA-Q I can give
you. ;)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Reproducible OOM with partial workaround
  2013-01-10 23:12   ` Dave Hansen
@ 2013-01-11  0:46     ` paul.szabo
  -1 siblings, 0 replies; 22+ messages in thread
From: paul.szabo @ 2013-01-11  0:46 UTC (permalink / raw)
  To: dave; +Cc: 695182, linux-kernel, linux-mm

Dear Dave,

> Your configuration has never worked.  This isn't a regression ...
> ... does not mean that we expect it to work.

Do you mean that CONFIG_HIGHMEM64G is deprecated, should not be used;
that all development is for 64-bit only?

> ... 64-bit kernels should basically be drop-in replacements ...

Will think about that. I know all my servers are 64-bit capable, will
need to check all my desktops.

---

I find it puzzling that there seems to be a sharp cutoff at 32GB RAM,
no problem under but OOM just over; whereas I would have expected
lowmem starvation to be gradual, with OOM occuring much sooner with
64GB than with 34GB. Also, the kernel seems capable of reclaiming
lowmem, so I wonder why does that fail just over the 32GB threshhold.
(Obviously I have no idea what I am talking about.)

---

Thanks, Paul

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Reproducible OOM with partial workaround
@ 2013-01-11  0:46     ` paul.szabo
  0 siblings, 0 replies; 22+ messages in thread
From: paul.szabo @ 2013-01-11  0:46 UTC (permalink / raw)
  To: dave; +Cc: 695182, linux-kernel, linux-mm

Dear Dave,

> Your configuration has never worked.  This isn't a regression ...
> ... does not mean that we expect it to work.

Do you mean that CONFIG_HIGHMEM64G is deprecated, should not be used;
that all development is for 64-bit only?

> ... 64-bit kernels should basically be drop-in replacements ...

Will think about that. I know all my servers are 64-bit capable, will
need to check all my desktops.

---

I find it puzzling that there seems to be a sharp cutoff at 32GB RAM,
no problem under but OOM just over; whereas I would have expected
lowmem starvation to be gradual, with OOM occuring much sooner with
64GB than with 34GB. Also, the kernel seems capable of reclaiming
lowmem, so I wonder why does that fail just over the 32GB threshhold.
(Obviously I have no idea what I am talking about.)

---

Thanks, Paul

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Reproducible OOM with partial workaround
  2013-01-11  0:46     ` paul.szabo
@ 2013-01-11  1:26       ` Dave Hansen
  -1 siblings, 0 replies; 22+ messages in thread
From: Dave Hansen @ 2013-01-11  1:26 UTC (permalink / raw)
  To: paul.szabo; +Cc: 695182, linux-kernel, linux-mm

On 01/10/2013 04:46 PM, paul.szabo@sydney.edu.au wrote:
>> Your configuration has never worked.  This isn't a regression ...
>> ... does not mean that we expect it to work.
> 
> Do you mean that CONFIG_HIGHMEM64G is deprecated, should not be used;
> that all development is for 64-bit only?

My last 4GB laptop had a 1GB hole and needed HIGHMEM64G since it had RAM
at 0->5GB.  That worked just fine, btw.  The problem isn't with
HIGHMEM64G itself.

I'm not saying HIGHMEM64G is inherently bad, just that it gets gradually
worse and worse as you add more RAM.  I don't believe 64GB of RAM has
_ever_ been booted on a 32-bit kernel without either violating the ABI
(3GB/1GB split) or doing something that never got merged upstream (that
4GB/4GB split, or other fun stuff like page clustering).

> I find it puzzling that there seems to be a sharp cutoff at 32GB RAM,
> no problem under but OOM just over; whereas I would have expected
> lowmem starvation to be gradual, with OOM occuring much sooner with
> 64GB than with 34GB. Also, the kernel seems capable of reclaiming
> lowmem, so I wonder why does that fail just over the 32GB threshhold.
> (Obviously I have no idea what I am talking about.)

It _is_ puzzling.  It isn't immediately obvious to me why the slab that
you have isn't being reclaimed.  There might, indeed, be a fixable bug
there.  But, there are probably a bunch more bugs which will keep you
from having a nice, smoothly-running system, mostly those bugs have not
had much attention in the 10 years or so since 64-bit x86 became
commonplace.  Plus, even 10 years ago, when folks were working on this
actively, we _never_ got things running smoothly on 32GB of RAM.  Take a
look at this:

http://support.bull.com/ols/product/system/linux/redhat/help/kbf/g/inst/PrKB11417

You are effectively running the "SMP kernel" (hugemem is a completely
different beast).

I had a 32GB i386 system.  It was a really, really fun system to play
with, and its never-ending list of bugs helped keep me employed for
several years.  You don't want to unnecessarily inflict that pain on
yourself, really.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Reproducible OOM with partial workaround
@ 2013-01-11  1:26       ` Dave Hansen
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Hansen @ 2013-01-11  1:26 UTC (permalink / raw)
  To: paul.szabo; +Cc: 695182, linux-kernel, linux-mm

On 01/10/2013 04:46 PM, paul.szabo@sydney.edu.au wrote:
>> Your configuration has never worked.  This isn't a regression ...
>> ... does not mean that we expect it to work.
> 
> Do you mean that CONFIG_HIGHMEM64G is deprecated, should not be used;
> that all development is for 64-bit only?

My last 4GB laptop had a 1GB hole and needed HIGHMEM64G since it had RAM
at 0->5GB.  That worked just fine, btw.  The problem isn't with
HIGHMEM64G itself.

I'm not saying HIGHMEM64G is inherently bad, just that it gets gradually
worse and worse as you add more RAM.  I don't believe 64GB of RAM has
_ever_ been booted on a 32-bit kernel without either violating the ABI
(3GB/1GB split) or doing something that never got merged upstream (that
4GB/4GB split, or other fun stuff like page clustering).

> I find it puzzling that there seems to be a sharp cutoff at 32GB RAM,
> no problem under but OOM just over; whereas I would have expected
> lowmem starvation to be gradual, with OOM occuring much sooner with
> 64GB than with 34GB. Also, the kernel seems capable of reclaiming
> lowmem, so I wonder why does that fail just over the 32GB threshhold.
> (Obviously I have no idea what I am talking about.)

It _is_ puzzling.  It isn't immediately obvious to me why the slab that
you have isn't being reclaimed.  There might, indeed, be a fixable bug
there.  But, there are probably a bunch more bugs which will keep you
from having a nice, smoothly-running system, mostly those bugs have not
had much attention in the 10 years or so since 64-bit x86 became
commonplace.  Plus, even 10 years ago, when folks were working on this
actively, we _never_ got things running smoothly on 32GB of RAM.  Take a
look at this:

http://support.bull.com/ols/product/system/linux/redhat/help/kbf/g/inst/PrKB11417

You are effectively running the "SMP kernel" (hugemem is a completely
different beast).

I had a 32GB i386 system.  It was a really, really fun system to play
with, and its never-ending list of bugs helped keep me employed for
several years.  You don't want to unnecessarily inflict that pain on
yourself, really.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Reproducible OOM with partial workaround
  2013-01-11  1:26       ` Dave Hansen
@ 2013-01-11  1:46         ` paul.szabo
  -1 siblings, 0 replies; 22+ messages in thread
From: paul.szabo @ 2013-01-11  1:46 UTC (permalink / raw)
  To: dave; +Cc: 695182, linux-kernel, linux-mm

Dear Dave,

> ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit
> kernel without either violating the ABI (3GB/1GB split) or doing
> something that never got merged upstream ...

Sorry to be so contradictory:

psz@como:~$ uname -a
Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 18:34:25 EST 2013 i686 GNU/Linux
psz@como:~$ free -l
             total       used       free     shared    buffers     cached
Mem:      64446900    4729292   59717608          0      15972     480520
Low:        375836     304400      71436
High:     64071064    4424892   59646172
-/+ buffers/cache:    4232800   60214100
Swap:    134217724          0  134217724
psz@como:~$ 

(though I would not know about violations).

But OK, I take your point that I should move with the times.

Cheers, Paul

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Reproducible OOM with partial workaround
@ 2013-01-11  1:46         ` paul.szabo
  0 siblings, 0 replies; 22+ messages in thread
From: paul.szabo @ 2013-01-11  1:46 UTC (permalink / raw)
  To: dave; +Cc: 695182, linux-kernel, linux-mm

Dear Dave,

> ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit
> kernel without either violating the ABI (3GB/1GB split) or doing
> something that never got merged upstream ...

Sorry to be so contradictory:

psz@como:~$ uname -a
Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 18:34:25 EST 2013 i686 GNU/Linux
psz@como:~$ free -l
             total       used       free     shared    buffers     cached
Mem:      64446900    4729292   59717608          0      15972     480520
Low:        375836     304400      71436
High:     64071064    4424892   59646172
-/+ buffers/cache:    4232800   60214100
Swap:    134217724          0  134217724
psz@como:~$ 

(though I would not know about violations).

But OK, I take your point that I should move with the times.

Cheers, Paul

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Reproducible OOM with partial workaround
  2013-01-11  1:46         ` paul.szabo
@ 2013-01-11  8:01           ` Andrew Morton
  -1 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2013-01-11  8:01 UTC (permalink / raw)
  To: paul.szabo; +Cc: dave, 695182, linux-kernel, linux-mm

On Fri, 11 Jan 2013 12:46:15 +1100 paul.szabo@sydney.edu.au wrote:

> > ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit
> > kernel without either violating the ABI (3GB/1GB split) or doing
> > something that never got merged upstream ...
> 
> Sorry to be so contradictory:
> 
> psz@como:~$ uname -a
> Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 18:34:25 EST 2013 i686 GNU/Linux
> psz@como:~$ free -l
>              total       used       free     shared    buffers     cached
> Mem:      64446900    4729292   59717608          0      15972     480520
> Low:        375836     304400      71436
> High:     64071064    4424892   59646172
> -/+ buffers/cache:    4232800   60214100
> Swap:    134217724          0  134217724
> psz@como:~$ 
> 
> (though I would not know about violations).
> 
> But OK, I take your point that I should move with the times.

Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.

If so, you *may* be able to work around this by setting
/proc/sys/vm/dirty_ratio really low, so the system keeps a minimum
amount of dirty pagecache around.  Then, with luck, if we haven't
broken the buffer_heads_over_limit logic it in the past decade (we
probably have), the VM should be able to reclaim those buffer_heads.

Alternatively, use a filesystem which doesn't attach buffer_heads to
dirty pages.  xfs or btrfs, perhaps.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Reproducible OOM with partial workaround
@ 2013-01-11  8:01           ` Andrew Morton
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2013-01-11  8:01 UTC (permalink / raw)
  To: paul.szabo; +Cc: dave, 695182, linux-kernel, linux-mm

On Fri, 11 Jan 2013 12:46:15 +1100 paul.szabo@sydney.edu.au wrote:

> > ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit
> > kernel without either violating the ABI (3GB/1GB split) or doing
> > something that never got merged upstream ...
> 
> Sorry to be so contradictory:
> 
> psz@como:~$ uname -a
> Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 18:34:25 EST 2013 i686 GNU/Linux
> psz@como:~$ free -l
>              total       used       free     shared    buffers     cached
> Mem:      64446900    4729292   59717608          0      15972     480520
> Low:        375836     304400      71436
> High:     64071064    4424892   59646172
> -/+ buffers/cache:    4232800   60214100
> Swap:    134217724          0  134217724
> psz@como:~$ 
> 
> (though I would not know about violations).
> 
> But OK, I take your point that I should move with the times.

Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.

If so, you *may* be able to work around this by setting
/proc/sys/vm/dirty_ratio really low, so the system keeps a minimum
amount of dirty pagecache around.  Then, with luck, if we haven't
broken the buffer_heads_over_limit logic it in the past decade (we
probably have), the VM should be able to reclaim those buffer_heads.

Alternatively, use a filesystem which doesn't attach buffer_heads to
dirty pages.  xfs or btrfs, perhaps.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Reproducible OOM with partial workaround
  2013-01-11  8:01           ` Andrew Morton
@ 2013-01-11  8:30             ` Simon Jeons
  -1 siblings, 0 replies; 22+ messages in thread
From: Simon Jeons @ 2013-01-11  8:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: paul.szabo, dave, 695182, linux-kernel, linux-mm

On Fri, 2013-01-11 at 00:01 -0800, Andrew Morton wrote:
> On Fri, 11 Jan 2013 12:46:15 +1100 paul.szabo@sydney.edu.au wrote:
> 
> > > ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit
> > > kernel without either violating the ABI (3GB/1GB split) or doing
> > > something that never got merged upstream ...
> > 
> > Sorry to be so contradictory:
> > 
> > psz@como:~$ uname -a
> > Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 18:34:25 EST 2013 i686 GNU/Linux
> > psz@como:~$ free -l
> >              total       used       free     shared    buffers     cached
> > Mem:      64446900    4729292   59717608          0      15972     480520
> > Low:        375836     304400      71436
> > High:     64071064    4424892   59646172
> > -/+ buffers/cache:    4232800   60214100
> > Swap:    134217724          0  134217724
> > psz@como:~$ 
> > 
> > (though I would not know about violations).
> > 
> > But OK, I take your point that I should move with the times.
> 
> Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.
> 
> If so, you *may* be able to work around this by setting
> /proc/sys/vm/dirty_ratio really low, so the system keeps a minimum
> amount of dirty pagecache around.  Then, with luck, if we haven't
> broken the buffer_heads_over_limit logic it in the past decade (we
> probably have), the VM should be able to reclaim those buffer_heads.
> 
> Alternatively, use a filesystem which doesn't attach buffer_heads to
> dirty pages.  xfs or btrfs, perhaps.
> 

Hi Andrew,

What's the meaning of attaching buffer_heads to dirty pages?

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Reproducible OOM with partial workaround
@ 2013-01-11  8:30             ` Simon Jeons
  0 siblings, 0 replies; 22+ messages in thread
From: Simon Jeons @ 2013-01-11  8:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: paul.szabo, dave, 695182, linux-kernel, linux-mm

On Fri, 2013-01-11 at 00:01 -0800, Andrew Morton wrote:
> On Fri, 11 Jan 2013 12:46:15 +1100 paul.szabo@sydney.edu.au wrote:
> 
> > > ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit
> > > kernel without either violating the ABI (3GB/1GB split) or doing
> > > something that never got merged upstream ...
> > 
> > Sorry to be so contradictory:
> > 
> > psz@como:~$ uname -a
> > Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 18:34:25 EST 2013 i686 GNU/Linux
> > psz@como:~$ free -l
> >              total       used       free     shared    buffers     cached
> > Mem:      64446900    4729292   59717608          0      15972     480520
> > Low:        375836     304400      71436
> > High:     64071064    4424892   59646172
> > -/+ buffers/cache:    4232800   60214100
> > Swap:    134217724          0  134217724
> > psz@como:~$ 
> > 
> > (though I would not know about violations).
> > 
> > But OK, I take your point that I should move with the times.
> 
> Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.
> 
> If so, you *may* be able to work around this by setting
> /proc/sys/vm/dirty_ratio really low, so the system keeps a minimum
> amount of dirty pagecache around.  Then, with luck, if we haven't
> broken the buffer_heads_over_limit logic it in the past decade (we
> probably have), the VM should be able to reclaim those buffer_heads.
> 
> Alternatively, use a filesystem which doesn't attach buffer_heads to
> dirty pages.  xfs or btrfs, perhaps.
> 

Hi Andrew,

What's the meaning of attaching buffer_heads to dirty pages?

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Reproducible OOM with partial workaround
  2013-01-11  8:01           ` Andrew Morton
@ 2013-01-11 11:51             ` paul.szabo
  -1 siblings, 0 replies; 22+ messages in thread
From: paul.szabo @ 2013-01-11 11:51 UTC (permalink / raw)
  To: akpm; +Cc: 695182, dave, linux-kernel, linux-mm

Dear Andrew,

> Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.

Please see below: I do not know what any of that means. This machine has
been running just fine, with all my users logging in here via XDMCP from
X-terminals, dozens logged in simultaneously. (But, I think I could make
it go OOM with more processes or logins.)

> If so, you *may* be able to work around this by setting
> /proc/sys/vm/dirty_ratio really low, so the system keeps a minimum
> amount of dirty pagecache around.  Then, with luck, if we haven't
> broken the buffer_heads_over_limit logic it in the past decade (we
> probably have), the VM should be able to reclaim those buffer_heads.

I tried setting dirty_ratio to "funny" values, that did not seem to
help. Did you notice my patch about bdi_position_ratio(), how it was
plain wrong half the time (for negative x)? Anyway that did not help.

> Alternatively, use a filesystem which doesn't attach buffer_heads to
> dirty pages.  xfs or btrfs, perhaps.

Seems there is also a problem not related to filesystem... or rather,
the essence does not seem to be filesystem or caches. The filesystem
thing now seems OK with my patch doing drop_caches.

Cheers, Paul

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia


---

root@como:~# free -lm
             total       used       free     shared    buffers     cached
Mem:         62936       2317      60618          0         41        635
Low:           367        271         95
High:        62569       2045      60523
-/+ buffers/cache:       1640      61295
Swap:       131071          0     131071
root@como:~# cat /proc/slabinfo
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
fuse_request           0      0    376   43    4 : tunables    0    0    0 : slabdata      0      0      0
fuse_inode             0      0    448   36    4 : tunables    0    0    0 : slabdata      0      0      0
bsg_cmd                0      0    288   28    2 : tunables    0    0    0 : slabdata      0      0      0
ntfs_big_inode_cache      0      0    512   32    4 : tunables    0    0    0 : slabdata      0      0      0
ntfs_inode_cache       0      0    176   46    2 : tunables    0    0    0 : slabdata      0      0      0
nfs_direct_cache       0      0     80   51    1 : tunables    0    0    0 : slabdata      0      0      0
nfs_inode_cache     5404   5404    584   28    4 : tunables    0    0    0 : slabdata    193    193      0
isofs_inode_cache      0      0    360   45    4 : tunables    0    0    0 : slabdata      0      0      0
fat_inode_cache        0      0    408   40    4 : tunables    0    0    0 : slabdata      0      0      0
fat_cache              0      0     24  170    1 : tunables    0    0    0 : slabdata      0      0      0
jbd2_revoke_record      0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
journal_handle      5440   5440     24  170    1 : tunables    0    0    0 : slabdata     32     32      0
journal_head       16768  16768     64   64    1 : tunables    0    0    0 : slabdata    262    262      0
revoke_record      20224  20224     16  256    1 : tunables    0    0    0 : slabdata     79     79      0
ext4_inode_cache       0      0    584   28    4 : tunables    0    0    0 : slabdata      0      0      0
ext4_free_data         0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_allocation_context      0      0    112   36    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_prealloc_space      0      0     72   56    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_io_end            0      0    576   28    4 : tunables    0    0    0 : slabdata      0      0      0
ext4_io_page           0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
ext2_inode_cache       0      0    480   34    4 : tunables    0    0    0 : slabdata      0      0      0
ext3_inode_cache   16531  19965    488   33    4 : tunables    0    0    0 : slabdata    605    605      0
ext3_xattr             0      0     48   85    1 : tunables    0    0    0 : slabdata      0      0      0
dquot                840    840    192   42    2 : tunables    0    0    0 : slabdata     20     20      0
rpc_inode_cache      144    144    448   36    4 : tunables    0    0    0 : slabdata      4      4      0
UDP-Lite               0      0    576   28    4 : tunables    0    0    0 : slabdata      0      0      0
xfrm_dst_cache         0      0    320   51    4 : tunables    0    0    0 : slabdata      0      0      0
UDP                  896    896    576   28    4 : tunables    0    0    0 : slabdata     32     32      0
tw_sock_TCP         1344   1344    128   32    1 : tunables    0    0    0 : slabdata     42     42      0
TCP                 1457   1624   1152   28    8 : tunables    0    0    0 : slabdata     58     58      0
eventpoll_pwq       3264   3264     40  102    1 : tunables    0    0    0 : slabdata     32     32      0
blkdev_queue         330    330    968   33    8 : tunables    0    0    0 : slabdata     10     10      0
blkdev_requests     2368   2368    216   37    2 : tunables    0    0    0 : slabdata     64     64      0
biovec-256           350    350   3072   10    8 : tunables    0    0    0 : slabdata     35     35      0
biovec-128           693    693   1536   21    8 : tunables    0    0    0 : slabdata     33     33      0
biovec-64           1890   1890    768   42    8 : tunables    0    0    0 : slabdata     45     45      0
sock_inode_cache    8206   9408    384   42    4 : tunables    0    0    0 : slabdata    224    224      0
skbuff_fclone_cache   1806   1806    384   42    4 : tunables    0    0    0 : slabdata     43     43      0
file_lock_cache     1692   1692    112   36    1 : tunables    0    0    0 : slabdata     47     47      0
shmem_inode_cache   2244   2244    368   44    4 : tunables    0    0    0 : slabdata     51     51      0
Acpi-State         76245  76245     48   85    1 : tunables    0    0    0 : slabdata    897    897      0
taskstats           1568   1568    328   49    4 : tunables    0    0    0 : slabdata     32     32      0
proc_inode_cache   10736  10736    368   44    4 : tunables    0    0    0 : slabdata    244    244      0
sigqueue            1120   1120    144   28    1 : tunables    0    0    0 : slabdata     40     40      0
bdev_cache           608    608    512   32    4 : tunables    0    0    0 : slabdata     19     19      0
sysfs_dir_cache    36057  36057     80   51    1 : tunables    0    0    0 : slabdata    707    707      0
inode_cache         7584   7584    336   48    4 : tunables    0    0    0 : slabdata    158    158      0
dentry             32995  43584    128   32    1 : tunables    0    0    0 : slabdata   1362   1362      0
buffer_head        83001  83001     56   73    1 : tunables    0    0    0 : slabdata   1137   1137      0
vm_area_struct     51480  83352     88   46    1 : tunables    0    0    0 : slabdata   1812   1812      0
mm_struct           2257   2556    448   36    4 : tunables    0    0    0 : slabdata     71     71      0
signal_cache        3584   3584    576   28    4 : tunables    0    0    0 : slabdata    128    128      0
sighand_cache       2664   2664   1344   24    8 : tunables    0    0    0 : slabdata    111    111      0
task_xstate         8154   8268    832   39    8 : tunables    0    0    0 : slabdata    212    212      0
task_struct         8896   8896   1008   32    8 : tunables    0    0    0 : slabdata    278    278      0
anon_vma_chain     70596  96050     24  170    1 : tunables    0    0    0 : slabdata    565    565      0
anon_vma           52113  62934     40  102    1 : tunables    0    0    0 : slabdata    617    617      0
radix_tree_node    15722  22578    304   53    4 : tunables    0    0    0 : slabdata    426    426      0
idr_layer_cache     9116   9116    152   53    2 : tunables    0    0    0 : slabdata    172    172      0
dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-1024       0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-512        0      0    512   32    4 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-256        0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-128        0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-192        0      0    192   42    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-8192         272    272   8192    4    8 : tunables    0    0    0 : slabdata     68     68      0
kmalloc-4096         585    608   4096    8    8 : tunables    0    0    0 : slabdata     76     76      0
kmalloc-2048         714    832   2048   16    8 : tunables    0    0    0 : slabdata     52     52      0
kmalloc-1024        5351   5536   1024   32    8 : tunables    0    0    0 : slabdata    173    173      0
kmalloc-512         7776   8512    512   32    4 : tunables    0    0    0 : slabdata    266    266      0
kmalloc-256         3334   3936    256   32    2 : tunables    0    0    0 : slabdata    123    123      0
kmalloc-128         5375   7744    128   32    1 : tunables    0    0    0 : slabdata    242    242      0
kmalloc-64         28005  35584     64   64    1 : tunables    0    0    0 : slabdata    556    556      0
kmalloc-32         67453  68224     32  128    1 : tunables    0    0    0 : slabdata    533    533      0
kmalloc-16         78772  83968     16  256    1 : tunables    0    0    0 : slabdata    328    328      0
kmalloc-8          70656  70656      8  512    1 : tunables    0    0    0 : slabdata    138    138      0
kmalloc-192        38594  64050    192   42    2 : tunables    0    0    0 : slabdata   1525   1525      0
kmalloc-96         21630  21630     96   42    1 : tunables    0    0    0 : slabdata    515    515      0
kmem_cache            32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0
kmem_cache_node      512    512     32  128    1 : tunables    0    0    0 : slabdata      4      4      0
root@como:~# 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Reproducible OOM with partial workaround
@ 2013-01-11 11:51             ` paul.szabo
  0 siblings, 0 replies; 22+ messages in thread
From: paul.szabo @ 2013-01-11 11:51 UTC (permalink / raw)
  To: akpm; +Cc: 695182, dave, linux-kernel, linux-mm

Dear Andrew,

> Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.

Please see below: I do not know what any of that means. This machine has
been running just fine, with all my users logging in here via XDMCP from
X-terminals, dozens logged in simultaneously. (But, I think I could make
it go OOM with more processes or logins.)

> If so, you *may* be able to work around this by setting
> /proc/sys/vm/dirty_ratio really low, so the system keeps a minimum
> amount of dirty pagecache around.  Then, with luck, if we haven't
> broken the buffer_heads_over_limit logic it in the past decade (we
> probably have), the VM should be able to reclaim those buffer_heads.

I tried setting dirty_ratio to "funny" values, that did not seem to
help. Did you notice my patch about bdi_position_ratio(), how it was
plain wrong half the time (for negative x)? Anyway that did not help.

> Alternatively, use a filesystem which doesn't attach buffer_heads to
> dirty pages.  xfs or btrfs, perhaps.

Seems there is also a problem not related to filesystem... or rather,
the essence does not seem to be filesystem or caches. The filesystem
thing now seems OK with my patch doing drop_caches.

Cheers, Paul

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia


---

root@como:~# free -lm
             total       used       free     shared    buffers     cached
Mem:         62936       2317      60618          0         41        635
Low:           367        271         95
High:        62569       2045      60523
-/+ buffers/cache:       1640      61295
Swap:       131071          0     131071
root@como:~# cat /proc/slabinfo
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
fuse_request           0      0    376   43    4 : tunables    0    0    0 : slabdata      0      0      0
fuse_inode             0      0    448   36    4 : tunables    0    0    0 : slabdata      0      0      0
bsg_cmd                0      0    288   28    2 : tunables    0    0    0 : slabdata      0      0      0
ntfs_big_inode_cache      0      0    512   32    4 : tunables    0    0    0 : slabdata      0      0      0
ntfs_inode_cache       0      0    176   46    2 : tunables    0    0    0 : slabdata      0      0      0
nfs_direct_cache       0      0     80   51    1 : tunables    0    0    0 : slabdata      0      0      0
nfs_inode_cache     5404   5404    584   28    4 : tunables    0    0    0 : slabdata    193    193      0
isofs_inode_cache      0      0    360   45    4 : tunables    0    0    0 : slabdata      0      0      0
fat_inode_cache        0      0    408   40    4 : tunables    0    0    0 : slabdata      0      0      0
fat_cache              0      0     24  170    1 : tunables    0    0    0 : slabdata      0      0      0
jbd2_revoke_record      0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
journal_handle      5440   5440     24  170    1 : tunables    0    0    0 : slabdata     32     32      0
journal_head       16768  16768     64   64    1 : tunables    0    0    0 : slabdata    262    262      0
revoke_record      20224  20224     16  256    1 : tunables    0    0    0 : slabdata     79     79      0
ext4_inode_cache       0      0    584   28    4 : tunables    0    0    0 : slabdata      0      0      0
ext4_free_data         0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_allocation_context      0      0    112   36    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_prealloc_space      0      0     72   56    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_io_end            0      0    576   28    4 : tunables    0    0    0 : slabdata      0      0      0
ext4_io_page           0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
ext2_inode_cache       0      0    480   34    4 : tunables    0    0    0 : slabdata      0      0      0
ext3_inode_cache   16531  19965    488   33    4 : tunables    0    0    0 : slabdata    605    605      0
ext3_xattr             0      0     48   85    1 : tunables    0    0    0 : slabdata      0      0      0
dquot                840    840    192   42    2 : tunables    0    0    0 : slabdata     20     20      0
rpc_inode_cache      144    144    448   36    4 : tunables    0    0    0 : slabdata      4      4      0
UDP-Lite               0      0    576   28    4 : tunables    0    0    0 : slabdata      0      0      0
xfrm_dst_cache         0      0    320   51    4 : tunables    0    0    0 : slabdata      0      0      0
UDP                  896    896    576   28    4 : tunables    0    0    0 : slabdata     32     32      0
tw_sock_TCP         1344   1344    128   32    1 : tunables    0    0    0 : slabdata     42     42      0
TCP                 1457   1624   1152   28    8 : tunables    0    0    0 : slabdata     58     58      0
eventpoll_pwq       3264   3264     40  102    1 : tunables    0    0    0 : slabdata     32     32      0
blkdev_queue         330    330    968   33    8 : tunables    0    0    0 : slabdata     10     10      0
blkdev_requests     2368   2368    216   37    2 : tunables    0    0    0 : slabdata     64     64      0
biovec-256           350    350   3072   10    8 : tunables    0    0    0 : slabdata     35     35      0
biovec-128           693    693   1536   21    8 : tunables    0    0    0 : slabdata     33     33      0
biovec-64           1890   1890    768   42    8 : tunables    0    0    0 : slabdata     45     45      0
sock_inode_cache    8206   9408    384   42    4 : tunables    0    0    0 : slabdata    224    224      0
skbuff_fclone_cache   1806   1806    384   42    4 : tunables    0    0    0 : slabdata     43     43      0
file_lock_cache     1692   1692    112   36    1 : tunables    0    0    0 : slabdata     47     47      0
shmem_inode_cache   2244   2244    368   44    4 : tunables    0    0    0 : slabdata     51     51      0
Acpi-State         76245  76245     48   85    1 : tunables    0    0    0 : slabdata    897    897      0
taskstats           1568   1568    328   49    4 : tunables    0    0    0 : slabdata     32     32      0
proc_inode_cache   10736  10736    368   44    4 : tunables    0    0    0 : slabdata    244    244      0
sigqueue            1120   1120    144   28    1 : tunables    0    0    0 : slabdata     40     40      0
bdev_cache           608    608    512   32    4 : tunables    0    0    0 : slabdata     19     19      0
sysfs_dir_cache    36057  36057     80   51    1 : tunables    0    0    0 : slabdata    707    707      0
inode_cache         7584   7584    336   48    4 : tunables    0    0    0 : slabdata    158    158      0
dentry             32995  43584    128   32    1 : tunables    0    0    0 : slabdata   1362   1362      0
buffer_head        83001  83001     56   73    1 : tunables    0    0    0 : slabdata   1137   1137      0
vm_area_struct     51480  83352     88   46    1 : tunables    0    0    0 : slabdata   1812   1812      0
mm_struct           2257   2556    448   36    4 : tunables    0    0    0 : slabdata     71     71      0
signal_cache        3584   3584    576   28    4 : tunables    0    0    0 : slabdata    128    128      0
sighand_cache       2664   2664   1344   24    8 : tunables    0    0    0 : slabdata    111    111      0
task_xstate         8154   8268    832   39    8 : tunables    0    0    0 : slabdata    212    212      0
task_struct         8896   8896   1008   32    8 : tunables    0    0    0 : slabdata    278    278      0
anon_vma_chain     70596  96050     24  170    1 : tunables    0    0    0 : slabdata    565    565      0
anon_vma           52113  62934     40  102    1 : tunables    0    0    0 : slabdata    617    617      0
radix_tree_node    15722  22578    304   53    4 : tunables    0    0    0 : slabdata    426    426      0
idr_layer_cache     9116   9116    152   53    2 : tunables    0    0    0 : slabdata    172    172      0
dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-1024       0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-512        0      0    512   32    4 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-256        0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-128        0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-192        0      0    192   42    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-8192         272    272   8192    4    8 : tunables    0    0    0 : slabdata     68     68      0
kmalloc-4096         585    608   4096    8    8 : tunables    0    0    0 : slabdata     76     76      0
kmalloc-2048         714    832   2048   16    8 : tunables    0    0    0 : slabdata     52     52      0
kmalloc-1024        5351   5536   1024   32    8 : tunables    0    0    0 : slabdata    173    173      0
kmalloc-512         7776   8512    512   32    4 : tunables    0    0    0 : slabdata    266    266      0
kmalloc-256         3334   3936    256   32    2 : tunables    0    0    0 : slabdata    123    123      0
kmalloc-128         5375   7744    128   32    1 : tunables    0    0    0 : slabdata    242    242      0
kmalloc-64         28005  35584     64   64    1 : tunables    0    0    0 : slabdata    556    556      0
kmalloc-32         67453  68224     32  128    1 : tunables    0    0    0 : slabdata    533    533      0
kmalloc-16         78772  83968     16  256    1 : tunables    0    0    0 : slabdata    328    328      0
kmalloc-8          70656  70656      8  512    1 : tunables    0    0    0 : slabdata    138    138      0
kmalloc-192        38594  64050    192   42    2 : tunables    0    0    0 : slabdata   1525   1525      0
kmalloc-96         21630  21630     96   42    1 : tunables    0    0    0 : slabdata    515    515      0
kmem_cache            32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0
kmem_cache_node      512    512     32  128    1 : tunables    0    0    0 : slabdata      4      4      0
root@como:~# 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Reproducible OOM with partial workaround
  2013-01-11 11:51             ` paul.szabo
@ 2013-01-11 20:31               ` Andrew Morton
  -1 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2013-01-11 20:31 UTC (permalink / raw)
  To: paul.szabo; +Cc: 695182, dave, linux-kernel, linux-mm

On Fri, 11 Jan 2013 22:51:35 +1100
paul.szabo@sydney.edu.au wrote:

> Dear Andrew,
> 
> > Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.
> 
> Please see below: I do not know what any of that means. This machine has
> been running just fine, with all my users logging in here via XDMCP from
> X-terminals, dozens logged in simultaneously. (But, I think I could make
> it go OOM with more processes or logins.)

I'm counting 107MB in slab there.  Was this dump taken when the system
was at or near oom?

Please send a copy of the oom-killer kernel message dump, if you still
have one.

> > If so, you *may* be able to work around this by setting
> > /proc/sys/vm/dirty_ratio really low, so the system keeps a minimum
> > amount of dirty pagecache around.  Then, with luck, if we haven't
> > broken the buffer_heads_over_limit logic it in the past decade (we
> > probably have), the VM should be able to reclaim those buffer_heads.
> 
> I tried setting dirty_ratio to "funny" values, that did not seem to
> help.

Did you try setting it as low as possible?

> Did you notice my patch about bdi_position_ratio(), how it was
> plain wrong half the time (for negative x)? 

Nope, please resend.

> Anyway that did not help.
> 
> > Alternatively, use a filesystem which doesn't attach buffer_heads to
> > dirty pages.  xfs or btrfs, perhaps.
> 
> Seems there is also a problem not related to filesystem... or rather,
> the essence does not seem to be filesystem or caches. The filesystem
> thing now seems OK with my patch doing drop_caches.

hm, if doing a regular drop_caches fixes things then that implies the
problem is not with dirty pagecache.  Odd.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Reproducible OOM with partial workaround
@ 2013-01-11 20:31               ` Andrew Morton
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2013-01-11 20:31 UTC (permalink / raw)
  To: paul.szabo; +Cc: 695182, dave, linux-kernel, linux-mm

On Fri, 11 Jan 2013 22:51:35 +1100
paul.szabo@sydney.edu.au wrote:

> Dear Andrew,
> 
> > Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.
> 
> Please see below: I do not know what any of that means. This machine has
> been running just fine, with all my users logging in here via XDMCP from
> X-terminals, dozens logged in simultaneously. (But, I think I could make
> it go OOM with more processes or logins.)

I'm counting 107MB in slab there.  Was this dump taken when the system
was at or near oom?

Please send a copy of the oom-killer kernel message dump, if you still
have one.

> > If so, you *may* be able to work around this by setting
> > /proc/sys/vm/dirty_ratio really low, so the system keeps a minimum
> > amount of dirty pagecache around.  Then, with luck, if we haven't
> > broken the buffer_heads_over_limit logic it in the past decade (we
> > probably have), the VM should be able to reclaim those buffer_heads.
> 
> I tried setting dirty_ratio to "funny" values, that did not seem to
> help.

Did you try setting it as low as possible?

> Did you notice my patch about bdi_position_ratio(), how it was
> plain wrong half the time (for negative x)? 

Nope, please resend.

> Anyway that did not help.
> 
> > Alternatively, use a filesystem which doesn't attach buffer_heads to
> > dirty pages.  xfs or btrfs, perhaps.
> 
> Seems there is also a problem not related to filesystem... or rather,
> the essence does not seem to be filesystem or caches. The filesystem
> thing now seems OK with my patch doing drop_caches.

hm, if doing a regular drop_caches fixes things then that implies the
problem is not with dirty pagecache.  Odd.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Reproducible OOM with partial workaround
  2013-01-11 20:31               ` Andrew Morton
@ 2013-01-12  3:24                 ` paul.szabo
  -1 siblings, 0 replies; 22+ messages in thread
From: paul.szabo @ 2013-01-12  3:24 UTC (permalink / raw)
  To: akpm; +Cc: 695182, dave, linux-kernel, linux-mm

Dear Andrew,

>>> Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.
>> Please see below ...
> ... Was this dump taken when the system was at or near oom?

No, that was a "quiescent" machine. Please see a just-before-OOM dump in
my next message (in a little while).

> Please send a copy of the oom-killer kernel message dump, if you still
> have one.

Please see one in next message, or in
http://bugs.debian.org/695182

>> I tried setting dirty_ratio to "funny" values, that did not seem to
>> help.
> Did you try setting it as low as possible?

Probably. Maybe. Sorry, cannot say with certainty.

>> Did you notice my patch about bdi_position_ratio(), how it was
>> plain wrong half the time (for negative x)? 
> Nope, please resend.

Quoting from
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=101;att=1;bug=695182
:
...
 - In bdi_position_ratio() get difference (setpoint-dirty) right even
   when it is negative, which happens often. Normally these numbers are
   "small" and even with left-shift I never observed a 32-bit overflow.
   I believe it should be possible to re-write the whole function in
   32-bit ints; maybe it is not worth the effort to make it "efficient";
   seeing how this function was always wrong and we survived, it should
   simply be removed.
...
--- mm/page-writeback.c.old	2012-10-17 13:50:15.000000000 +1100
+++ mm/page-writeback.c	2013-01-06 21:54:59.000000000 +1100
[ Line numbers out because other patches not shown ]
...
@@ -559,7 +578,7 @@ static unsigned long bdi_position_ratio(
 	 *     => fast response on large errors; small oscillation near setpoint
 	 */
 	setpoint = (freerun + limit) / 2;
-	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+	x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
 		    limit - setpoint + 1);
 	pos_ratio = x;
 	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
...

Cheers, Paul

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Reproducible OOM with partial workaround
@ 2013-01-12  3:24                 ` paul.szabo
  0 siblings, 0 replies; 22+ messages in thread
From: paul.szabo @ 2013-01-12  3:24 UTC (permalink / raw)
  To: akpm; +Cc: 695182, dave, linux-kernel, linux-mm

Dear Andrew,

>>> Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.
>> Please see below ...
> ... Was this dump taken when the system was at or near oom?

No, that was a "quiescent" machine. Please see a just-before-OOM dump in
my next message (in a little while).

> Please send a copy of the oom-killer kernel message dump, if you still
> have one.

Please see one in next message, or in
http://bugs.debian.org/695182

>> I tried setting dirty_ratio to "funny" values, that did not seem to
>> help.
> Did you try setting it as low as possible?

Probably. Maybe. Sorry, cannot say with certainty.

>> Did you notice my patch about bdi_position_ratio(), how it was
>> plain wrong half the time (for negative x)? 
> Nope, please resend.

Quoting from
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=101;att=1;bug=695182
:
...
 - In bdi_position_ratio() get difference (setpoint-dirty) right even
   when it is negative, which happens often. Normally these numbers are
   "small" and even with left-shift I never observed a 32-bit overflow.
   I believe it should be possible to re-write the whole function in
   32-bit ints; maybe it is not worth the effort to make it "efficient";
   seeing how this function was always wrong and we survived, it should
   simply be removed.
...
--- mm/page-writeback.c.old	2012-10-17 13:50:15.000000000 +1100
+++ mm/page-writeback.c	2013-01-06 21:54:59.000000000 +1100
[ Line numbers out because other patches not shown ]
...
@@ -559,7 +578,7 @@ static unsigned long bdi_position_ratio(
 	 *     => fast response on large errors; small oscillation near setpoint
 	 */
 	setpoint = (freerun + limit) / 2;
-	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+	x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
 		    limit - setpoint + 1);
 	pos_ratio = x;
 	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
...

Cheers, Paul

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Reproducible OOM with partial workaround
  2013-01-11  1:46         ` paul.szabo
@ 2013-01-11 16:04           ` Dave Hansen
  -1 siblings, 0 replies; 22+ messages in thread
From: Dave Hansen @ 2013-01-11 16:04 UTC (permalink / raw)
  To: paul.szabo; +Cc: 695182, linux-kernel, linux-mm

On 01/10/2013 05:46 PM, paul.szabo@sydney.edu.au wrote:
>> > ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit
>> > kernel without either violating the ABI (3GB/1GB split) or doing
>> > something that never got merged upstream ...
> Sorry to be so contradictory:
> 
> psz@como:~$ uname -a
> Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 18:34:25 EST 2013 i686 GNU/Linux
> psz@como:~$ free -l
>              total       used       free     shared    buffers     cached
> Mem:      64446900    4729292   59717608          0      15972     480520
> Low:        375836     304400      71436
> High:     64071064    4424892   59646172
> -/+ buffers/cache:    4232800   60214100
> Swap:    134217724          0  134217724

Hey, that's pretty cool!  I would swear that the mem_map[] overhead was
such that they wouldn't boot, but perhaps those brain cells died on me.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Reproducible OOM with partial workaround
@ 2013-01-11 16:04           ` Dave Hansen
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Hansen @ 2013-01-11 16:04 UTC (permalink / raw)
  To: paul.szabo; +Cc: 695182, linux-kernel, linux-mm

On 01/10/2013 05:46 PM, paul.szabo@sydney.edu.au wrote:
>> > ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit
>> > kernel without either violating the ABI (3GB/1GB split) or doing
>> > something that never got merged upstream ...
> Sorry to be so contradictory:
> 
> psz@como:~$ uname -a
> Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 18:34:25 EST 2013 i686 GNU/Linux
> psz@como:~$ free -l
>              total       used       free     shared    buffers     cached
> Mem:      64446900    4729292   59717608          0      15972     480520
> Low:        375836     304400      71436
> High:     64071064    4424892   59646172
> -/+ buffers/cache:    4232800   60214100
> Swap:    134217724          0  134217724

Hey, that's pretty cool!  I would swear that the mem_map[] overhead was
such that they wouldn't boot, but perhaps those brain cells died on me.




^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2013-01-12  3:24 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-10 21:58 [RFC] Reproducible OOM with partial workaround paul.szabo
2013-01-10 21:58 ` paul.szabo
2013-01-10 23:12 ` Dave Hansen
2013-01-10 23:12   ` Dave Hansen
2013-01-11  0:46   ` paul.szabo
2013-01-11  0:46     ` paul.szabo
2013-01-11  1:26     ` Dave Hansen
2013-01-11  1:26       ` Dave Hansen
2013-01-11  1:46       ` paul.szabo
2013-01-11  1:46         ` paul.szabo
2013-01-11  8:01         ` Andrew Morton
2013-01-11  8:01           ` Andrew Morton
2013-01-11  8:30           ` Simon Jeons
2013-01-11  8:30             ` Simon Jeons
2013-01-11 11:51           ` paul.szabo
2013-01-11 11:51             ` paul.szabo
2013-01-11 20:31             ` Andrew Morton
2013-01-11 20:31               ` Andrew Morton
2013-01-12  3:24               ` paul.szabo
2013-01-12  3:24                 ` paul.szabo
2013-01-11 16:04         ` Dave Hansen
2013-01-11 16:04           ` Dave Hansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.