VM Problems in 2.6.7 (Too active OOM Killer)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* VM Problems in 2.6.7 (Too active OOM Killer)
@ 2004-07-14  2:23 Peter Zaitsev
  2004-07-14  2:40 ` William Lee Irwin III
  2004-07-14  3:17 ` Andrea Arcangeli
  0 siblings, 2 replies; 33+ messages in thread
From: Peter Zaitsev @ 2004-07-14  2:23 UTC (permalink / raw)
  To: linux-kernel

Hi,

To be honest I was truly surprised seeing OOM killer killing MySQL
without any good reason during highly IO intensive test:

Out of Memory: Killed process 19301 (mysqld).
Out of Memory: Killed process 19302 (mysqld).
Out of Memory: Killed process 19303 (mysqld).
Out of Memory: Killed process 19304 (mysqld).
Out of Memory: Killed process 19305 (mysqld).
Out of Memory: Killed process 19306 (mysqld).
Out of Memory: Killed process 19309 (mysqld).
Out of Memory: Killed process 19310 (mysqld).
Out of Memory: Killed process 19311 (mysqld).
Out of Memory: Killed process 19312 (mysqld).
Out of Memory: Killed process 19737 (mysqld).
Out of Memory: Killed process 19739 (mysqld).
Out of Memory: Killed process 19821 (mysqld).


This box has 4G memory and running without swap (what I would need it
for If I can only use up to 3GB address space in the application anyway)


Here is how VMSTAT Looked like:

 0  4      0   7028  43436 1656020    0    0  3988  8752 1716 11803  8 
5 45 43
 0  9      0   7004  42520 1654692    0    0  4372  8642 1735 12803  8 
5 41 46
 2  3      0   7828  40784 1654252    0    0  4024  7838 1662 11486  7 
4 44 45
 5 13      0   7228  38652 1653800    0    0  4370  9087 1751 12864  9 
5 40 47
 0  2      0   5928  32976 1645808    0    0  4954  8866 1890 13352  9 
5 39 47
 0 15      0   6560  22052 1642996    0    0  5111  7699 2004 11819  8 
5 41 47
 0  2      0   6232  15760 1642624    0    0  4630  6841 1912 10315  6 
4 46 45
 1  6      0   7804  10912 1640332    0    0  4493  6446 1913  9362  6 
4 48 43
 0  5      0 391660   6080 1267356    0    0  4265  6404 1902  9483  6 
4 46 44
procs                      memory      swap          io    
system         cpu
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy
id wa
 0  2      0 330392   5696 1319216    0    0  4674  6309 1949  9792  6 
4 46 44
 0  2      0 278052   3520 1363212    0    0  4155  5177 1827  7459  4 
6 48 42
 1  0      0 2188840   5676 1387984    0    0  2176  2104 1410 21275  5
33 41 21
 0  1      0 2121864  16948 1421456    0    0  1110  3600 1317 85261 11
23 58  8
 1  0      0 2062688  27072 1454988    0    0   997  3612 1302 84842 11
22 58  9
 0  1      0 1997464  38160 1488236    0    0  1096  3685 1319 85041 11
23 59  8
 0  1      0 1930536  52184 1521540    0    0  1391  4318 1476 84995 11
23 58  9
 0  1      0 1841888  69896 1555032    0    0  1758  3630 1478 84814 11
23 58  8
 0  1      0 1758272  85860 1588504    0    0  1597  3651 1460 84212 11
23 58  8
 0  1      0 1683696 101220 1621764    0    0  1524  3694 1432 84605 11
23 58  8
 0  1      0 1609016 115756 1655032    0    0  1440  3802 1400 84774 11
23 59  8


So we had some 1.4GB of memory in "cached" state, so why not to shrink
cache instead ? 

I hope we're not going back to 2.4 times, where a lot of people fought
with VM/OOM problems and lost :)


I now this is likely not that much helpful and a lot more info is
needed. I'll see if I run into this again and will be asking which
information exactly is needed to troubleshoot the problem. 





-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-14  2:23 VM Problems in 2.6.7 (Too active OOM Killer) Peter Zaitsev
@ 2004-07-14  2:40 ` William Lee Irwin III
  2004-07-14  3:20   ` Peter Zaitsev
  2004-07-14  3:17 ` Andrea Arcangeli
  1 sibling, 1 reply; 33+ messages in thread
From: William Lee Irwin III @ 2004-07-14  2:40 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: linux-kernel

On Tue, Jul 13, 2004 at 07:23:44PM -0700, Peter Zaitsev wrote:
> To be honest I was truly surprised seeing OOM killer killing MySQL
> without any good reason during highly IO intensive test:
> Out of Memory: Killed process 19301 (mysqld).
> Out of Memory: Killed process 19302 (mysqld).
> Out of Memory: Killed process 19303 (mysqld).
> Out of Memory: Killed process 19304 (mysqld).
> Out of Memory: Killed process 19305 (mysqld).
> Out of Memory: Killed process 19306 (mysqld).
> Out of Memory: Killed process 19309 (mysqld).
> Out of Memory: Killed process 19310 (mysqld).
> Out of Memory: Killed process 19311 (mysqld).
> Out of Memory: Killed process 19312 (mysqld).
> Out of Memory: Killed process 19737 (mysqld).
> Out of Memory: Killed process 19739 (mysqld).
> Out of Memory: Killed process 19821 (mysqld).
> This box has 4G memory and running without swap (what I would need it
> for If I can only use up to 3GB address space in the application anyway)

Is this a regression from earlier 2.6 versions? Do you have an isolated
testcase (obviously I should be able to install mysql easily) I can use
to trigger this?


-- wli

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-14  2:23 VM Problems in 2.6.7 (Too active OOM Killer) Peter Zaitsev
  2004-07-14  2:40 ` William Lee Irwin III
@ 2004-07-14  3:17 ` Andrea Arcangeli
  2004-07-14  3:44   ` Peter Zaitsev
  2004-07-14  3:50   ` William Lee Irwin III
  1 sibling, 2 replies; 33+ messages in thread
From: Andrea Arcangeli @ 2004-07-14  3:17 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: linux-kernel

Hello Peter,

On Tue, Jul 13, 2004 at 07:23:44PM -0700, Peter Zaitsev wrote:
> Hi,
> 
> To be honest I was truly surprised seeing OOM killer killing MySQL
> without any good reason during highly IO intensive test:
> 
> Out of Memory: Killed process 19301 (mysqld).
> Out of Memory: Killed process 19302 (mysqld).
> Out of Memory: Killed process 19303 (mysqld).
> Out of Memory: Killed process 19304 (mysqld).
> Out of Memory: Killed process 19305 (mysqld).
> Out of Memory: Killed process 19306 (mysqld).
> Out of Memory: Killed process 19309 (mysqld).
> Out of Memory: Killed process 19310 (mysqld).
> Out of Memory: Killed process 19311 (mysqld).
> Out of Memory: Killed process 19312 (mysqld).
> Out of Memory: Killed process 19737 (mysqld).
> Out of Memory: Killed process 19739 (mysqld).
> Out of Memory: Killed process 19821 (mysqld).

this is a well known 2.6 oom-killer problem w/o swap. Not the worst one,
I mentioned the worst one here just a few weeks ago:
	
	http://groups.google.com/groups?q=g:thl1518647992d&dq=&hl=en&lr=&ie=UTF-8&selm=fa.i50b3kk.p0qsjs%40ifi.uio.no


the only fix at the moment is to use 2.4 with oom killer disabled (the
same issue could happen with 2.4 too). even if it would work better than
the above the oom killer will still get screwed by mlock and it simply
cannot know how much lowmem is freeable leading to deadlock instead of
-ENOMEM with syscalls if you fill the whole lowmem zone.

I fixed everything related to oom in 2.4 some year back, now need to
port to 2.6.

workaround is to add swap in 2.6, but in some condition it'll still
underpeform compared to 2.4 due the lack of the zone-reserve-ratio algo.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-14  2:40 ` William Lee Irwin III
@ 2004-07-14  3:20   ` Peter Zaitsev
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zaitsev @ 2004-07-14  3:20 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel

On Tue, 2004-07-13 at 19:40, William Lee Irwin III wrote:

> > This box has 4G memory and running without swap (what I would need it
> > for If I can only use up to 3GB address space in the application anyway)
> 
> Is this a regression from earlier 2.6 versions? Do you have an isolated
> testcase (obviously I should be able to install mysql easily) I can use
> to trigger this?

Not yet.  I've been doing various MySQL tests with various kernels and
this is the first time I see such behavior.

If It will repeat itself I will try to isolate it to the test case,
so fat I just wanted to report it to see if this is known or expected
issue.


-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-14  3:17 ` Andrea Arcangeli
@ 2004-07-14  3:44   ` Peter Zaitsev
  2004-07-14  4:10     ` Andrea Arcangeli
  2004-07-14  4:17     ` Andrew Morton
  2004-07-14  3:50   ` William Lee Irwin III
  1 sibling, 2 replies; 33+ messages in thread
From: Peter Zaitsev @ 2004-07-14  3:44 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

On Tue, 2004-07-13 at 20:17, Andrea Arcangeli wrote:

> > Out of Memory: Killed process 19821 (mysqld).
> 
> this is a well known 2.6 oom-killer problem w/o swap. Not the worst one,
> I mentioned the worst one here just a few weeks ago:

Thanks Andrea,

Your reply is very helpful as usually. 

> 	
> 	http://groups.google.com/groups?q=g:thl1518647992d&dq=&hl=en&lr=&ie=UTF-8&selm=fa.i50b3kk.p0qsjs%40ifi.uio.no
> 
> 
> the only fix at the moment is to use 2.4 with oom killer disabled (the
> same issue could happen with 2.4 too). even if it would work better than
> the above the oom killer will still get screwed by mlock and it simply
> cannot know how much lowmem is freeable leading to deadlock instead of
> -ENOMEM with syscalls if you fill the whole lowmem zone.
> 
> I fixed everything related to oom in 2.4 some year back, now need to
> port to 2.6.

When do you think it is going to happen ?

To be honest I recently was quite happy with 2.6.x stability, and in
2.6.7  IO performance for MySQL workloads seems to be mainly fixed - it
performs well even with default "as" scheduler.   However this problem
makes me to be more cautious once again.

> 
> workaround is to add swap in 2.6, but in some condition it'll still
> underpeform compared to 2.4 due the lack of the zone-reserve-ratio algo.

The reason for me to disable swap both in 2.4 and 2.6 is - it really
hurts performance. In some cases performance can be 2-3 times slower
with swap file enabled.   Using O_DIRECT and mlock() for buffers helps 
but not completely.

RedHat 2.4.x kernels are especially affected.  They seems to love to get
a lot of swap into the swap, however caching the large part of swapped
out. This still negatively affects performance. 

-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-14  3:17 ` Andrea Arcangeli
  2004-07-14  3:44   ` Peter Zaitsev
@ 2004-07-14  3:50   ` William Lee Irwin III
  1 sibling, 0 replies; 33+ messages in thread
From: William Lee Irwin III @ 2004-07-14  3:50 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Peter Zaitsev, linux-kernel

On Wed, Jul 14, 2004 at 05:17:01AM +0200, Andrea Arcangeli wrote:
> this is a well known 2.6 oom-killer problem w/o swap. Not the worst one,
> I mentioned the worst one here just a few weeks ago:
> 	http://groups.google.com/groups?q=g:thl1518647992d&dq=&hl=en&lr=&ie=UTF-8&selm=fa.i50b3kk.p0qsjs%40ifi.uio.no
> the only fix at the moment is to use 2.4 with oom killer disabled (the
> same issue could happen with 2.4 too). even if it would work better than
> the above the oom killer will still get screwed by mlock and it simply
> cannot know how much lowmem is freeable leading to deadlock instead of
> -ENOMEM with syscalls if you fill the whole lowmem zone.
> I fixed everything related to oom in 2.4 some year back, now need to
> port to 2.6.
> workaround is to add swap in 2.6, but in some condition it'll still
> underpeform compared to 2.4 due the lack of the zone-reserve-ratio algo.

Can we try to get a bit more specific? I suspect the reason this stuff
isn't getting much traction is because it's too broad to correlate to
internal kernel problems or the userspace cases that trigger them. I
think once we get that kind of documentation/changelogging we should be
able to get the pieces in.


-- wli

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-14  3:44   ` Peter Zaitsev
@ 2004-07-14  4:10     ` Andrea Arcangeli
  2004-07-14  4:22       ` Andrew Morton
  2004-07-14  4:17     ` Andrew Morton
  1 sibling, 1 reply; 33+ messages in thread
From: Andrea Arcangeli @ 2004-07-14  4:10 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: linux-kernel

On Tue, Jul 13, 2004 at 08:44:02PM -0700, Peter Zaitsev wrote:
> The reason for me to disable swap both in 2.4 and 2.6 is - it really
> hurts performance. In some cases performance can be 2-3 times slower
> with swap file enabled.   Using O_DIRECT and mlock() for buffers helps 
> but not completely.

in 2.4 you can disable swap just fine (with oom killer disabled). until
I/somebody fix 2.6 you can workaround this problem while still avoiding
to swap much by setting /proc/sys/vm/swappiness to 0 or similar to tell
the VM "please don't swap" even if swap is enabled ;). That will still
prevent the oom killer to kick in. The oom killer is forbidden to run
as long as `free` tells you that >= 4k of swap are still available to the
OS. There are no other fundamental vm problems left I'm aware of in
latest 2.6 besides these no-swap and oom issues.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-14  3:44   ` Peter Zaitsev
  2004-07-14  4:10     ` Andrea Arcangeli
@ 2004-07-14  4:17     ` Andrew Morton
  2004-07-14 23:47       ` Peter Zaitsev
  1 sibling, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2004-07-14  4:17 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: andrea, linux-kernel

Peter Zaitsev <peter@mysql.com> wrote:
>
> The reason for me to disable swap both in 2.4 and 2.6 is - it really
>  hurts performance. In some cases performance can be 2-3 times slower
>  with swap file enabled.   Using O_DIRECT and mlock() for buffers helps 
>  but not completely.

It's strange that swap should harm performance in this manner.  Is that
also the case on 2.6?

wrt this OOM problem: it's possible that your ZONE_NORMAL got filled with
anonymous memory which the VM is unable to do anything about.  If you're
going to run a highmem box swapless then you should tune the kernel so that
it doesn't use so much ZONE_NORMAL memory for anonymous pages.

Try

	echo 500 > /proc/sys/vm/lower_zone_protection

then do:

	echo m > /proc/sysrq-trigger; dmesg -c

You'll get output like this:

Normal free:407192kB min:936kB low:1872kB high:2808kB active:1572kB inactive:410348kB present:901120kB
protections[]: 0 468 128724

                     ^^^^^^

This number here means that the VM will make 128724 pages (500MB) of the
normal zone ineligible for anonymous memory and pagecache allocations. 
That is probably an appropriate setting for your application.  If you set
lower_zone_protection too low you'll still get OOMs.  If you set it too
high the file caching efficiency may suffer a little.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-14  4:10     ` Andrea Arcangeli
@ 2004-07-14  4:22       ` Andrew Morton
  2004-07-14  4:47         ` Andrea Arcangeli
  0 siblings, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2004-07-14  4:22 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: peter, linux-kernel

Andrea Arcangeli <andrea@suse.de> wrote:
>
> The oom killer is forbidden to run
>  as long as `free` tells you that >= 4k of swap are still available to the
>  OS.

That code was recently removed, so the OOM killer now kicks in if we run
out of normal zone due to pinned allocations (stack pages, etc).

This promptly caused oom-killings to occur during heavy swapstorms
with laptop_mode=1, which is as yet unfixed.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-14  4:22       ` Andrew Morton
@ 2004-07-14  4:47         ` Andrea Arcangeli
  0 siblings, 0 replies; 33+ messages in thread
From: Andrea Arcangeli @ 2004-07-14  4:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: peter, linux-kernel

On Tue, Jul 13, 2004 at 09:22:04PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > The oom killer is forbidden to run
> >  as long as `free` tells you that >= 4k of swap are still available to the
> >  OS.
> 
> That code was recently removed, so the OOM killer now kicks in if we run
> out of normal zone due to pinned allocations (stack pages, etc).

ah I didn't notice thanks for the info (I see it in the CVS), however he
was running 2.6.7 so the code I mentioned is still there in 2.6.7 and in
turn my suggestion will work fine there since it'll remove the oom
killer out of his way.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-14 23:47       ` Peter Zaitsev
@ 2004-07-14 22:44         ` Andrew Morton
  2004-07-15  0:06           ` Andrea Arcangeli
  2004-07-15  0:30           ` Peter Zaitsev
  2004-07-15  0:04         ` Andrea Arcangeli
  1 sibling, 2 replies; 33+ messages in thread
From: Andrew Morton @ 2004-07-14 22:44 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: andrea, linux-kernel

Peter Zaitsev <peter@mysql.com> wrote:
>
> > 
> > wrt this OOM problem: it's possible that your ZONE_NORMAL got filled with
> > anonymous memory which the VM is unable to do anything about.  If you're
> > going to run a highmem box swapless then you should tune the kernel so that
> > it doesn't use so much ZONE_NORMAL memory for anonymous pages.
> 
> My concern is mainly users which normally run kernel with default
> settings. Things should work for them as well. 

Yes, it could be that the default value of zero for lower_zone_protection
is not appropriate.

> To be honest I do not really understand this OOM without swap problem at
> all, why is it possible to move pages from ZONE_NORMAL to swap but not
> to other zones ? 

If the kernel has no swap there is nothing it can do with an anonymous page
(ie: the thing whcih malloc() gives you).  It is effectively pinned memory,
because there's nowhere we can write it to get rid of it.

If you end up pinning all of your ZONE_NORMAL pages with anonymous memory,
further GFP_KERNEL allocation attempts will go oom.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-14  4:17     ` Andrew Morton
@ 2004-07-14 23:47       ` Peter Zaitsev
  2004-07-14 22:44         ` Andrew Morton
  2004-07-15  0:04         ` Andrea Arcangeli
  0 siblings, 2 replies; 33+ messages in thread
From: Peter Zaitsev @ 2004-07-14 23:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: andrea, linux-kernel

On Tue, 2004-07-13 at 21:17, Andrew Morton wrote:
> Peter Zaitsev <peter@mysql.com> wrote:
> >
> > The reason for me to disable swap both in 2.4 and 2.6 is - it really
> >  hurts performance. In some cases performance can be 2-3 times slower
> >  with swap file enabled.   Using O_DIRECT and mlock() for buffers helps 
> >  but not completely.
> 
> It's strange that swap should harm performance in this manner.  Is that
> also the case on 2.6?

I've run the test and it looks like VM in 2.6 does not suffer from
excessive swapping, at least for this test. It could be you need to get
more memory load to get such effect (I was only using some 2G out of 4G
for application, the rest was file cache).

Anyway this is pretty good news.

> 
> wrt this OOM problem: it's possible that your ZONE_NORMAL got filled with
> anonymous memory which the VM is unable to do anything about.  If you're
> going to run a highmem box swapless then you should tune the kernel so that
> it doesn't use so much ZONE_NORMAL memory for anonymous pages.

My concern is mainly users which normally run kernel with default
settings. Things should work for them as well. 

To be honest I do not really understand this OOM without swap problem at
all, why is it possible to move pages from ZONE_NORMAL to swap but not
to other zones ? 

-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-14 23:47       ` Peter Zaitsev
  2004-07-14 22:44         ` Andrew Morton
@ 2004-07-15  0:04         ` Andrea Arcangeli
  2004-07-15  0:43           ` Peter Zaitsev
  2004-07-15  0:43           ` William Lee Irwin III
  1 sibling, 2 replies; 33+ messages in thread
From: Andrea Arcangeli @ 2004-07-15  0:04 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: Andrew Morton, linux-kernel

On Wed, Jul 14, 2004 at 04:47:04PM -0700, Peter Zaitsev wrote:
> To be honest I do not really understand this OOM without swap problem at
> all, why is it possible to move pages from ZONE_NORMAL to swap but not
> to other zones ? 

the oom without swap you reproduced is not related to ZONE_NORMAL
shortage. The pages in ZONE_NORMAL never goes into swap.

the ZONE_NORMAL oom is a separate issue from the oom killing you
reproduced. with 2.6.7 if you were hitting the ZONE_NORMAL shortage your
machine would lockup and it would never oom-kill anything (Andrew just
changed that in kernel CVS, so thanks to that change a ZONE_NORMAL
shortage will not deadlock anymore in 2.6.8, but OTOH in 2.6.8 adding
swap will not be enough anymore to workaround the oom-killing you
reproduced).

About the ZONE_NORMAL shortage without swap, rather than running
cpu-cache-hungry memcopies from lowmemzone to highmem (or even worse to
pass through swap like it happens in 2.6 mainline with swap enabled), I
believe it's better to reserve some ram in the lowmem zone, 800M of ram
on a 32G box should be a cheap price to pay compared to the cpu/IO cost
involved in moving memory around during the bench.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-14 22:44         ` Andrew Morton
@ 2004-07-15  0:06           ` Andrea Arcangeli
  2004-07-15  0:30           ` Peter Zaitsev
  1 sibling, 0 replies; 33+ messages in thread
From: Andrea Arcangeli @ 2004-07-15  0:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Peter Zaitsev, linux-kernel

On Wed, Jul 14, 2004 at 03:44:27PM -0700, Andrew Morton wrote:
> If you end up pinning all of your ZONE_NORMAL pages with anonymous memory,
> further GFP_KERNEL allocation attempts will go oom.

exactly. Same problem happens with mlock on top of pagecache. (i.e. the
old 2.4 google bug)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-14 22:44         ` Andrew Morton
  2004-07-15  0:06           ` Andrea Arcangeli
@ 2004-07-15  0:30           ` Peter Zaitsev
  2004-07-15  0:46             ` Andrea Arcangeli
  2004-07-15  1:54             ` William Lee Irwin III
  1 sibling, 2 replies; 33+ messages in thread
From: Peter Zaitsev @ 2004-07-15  0:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: andrea, linux-kernel

On Wed, 2004-07-14 at 15:44, Andrew Morton wrote:

> 
> > To be honest I do not really understand this OOM without swap problem at
> > all, why is it possible to move pages from ZONE_NORMAL to swap but not
> > to other zones ? 
> 
> If the kernel has no swap there is nothing it can do with an anonymous page
> (ie: the thing whcih malloc() gives you).  It is effectively pinned memory,
> because there's nowhere we can write it to get rid of it.

Why can't it be moved to other zone if there is a lot of place where ?
In general I was not pushing system in some kind of stress mode - There
was still a lot of cache memory available. Why it could not be instead
shrunk to accommodate allocation ? 

As I understand in my case with 4G there is  Normal zone and HighMem
zone where "user" anonymous memory can be located in any of these zones.
Is this observation correct ? 

> 
> If you end up pinning all of your ZONE_NORMAL pages with anonymous memory,
> further GFP_KERNEL allocation attempts will go oom.

Aha I see. So user level memory allocations can't cause OOM only kernel
level allocations can ?   In this case why do not you have some reserved
amount of space for these types of allocations by default ? 

In this case I also do not understand how swap space helps here ? If you
can't move page to over zone or shrink cache because of allocation type
how it happens you can however perform page swap ? 

-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-15  0:04         ` Andrea Arcangeli
@ 2004-07-15  0:43           ` Peter Zaitsev
  2004-07-15  0:43           ` William Lee Irwin III
  1 sibling, 0 replies; 33+ messages in thread
From: Peter Zaitsev @ 2004-07-15  0:43 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, linux-kernel

On Wed, 2004-07-14 at 17:04, Andrea Arcangeli wrote:

> the oom without swap you reproduced is not related to ZONE_NORMAL
> shortage. The pages in ZONE_NORMAL never goes into swap.

Hm. It looks like it gets now even more unclear. If ZONE_NORMAL pages do
not go to swap what goes where, ie on low memory boxes which do not have
HIGHMEM  ? 

> 
> the ZONE_NORMAL oom is a separate issue from the oom killing you
> reproduced. with 2.6.7 if you were hitting the ZONE_NORMAL shortage your
> machine would lockup and it would never oom-kill anything (Andrew just
> changed that in kernel CVS, so thanks to that change a ZONE_NORMAL
> shortage will not deadlock anymore in 2.6.8, but OTOH in 2.6.8 adding
> swap will not be enough anymore to workaround the oom-killing you
> reproduced).

OOM is better than Lockup but still not good at all.  The problem is
from user standpoint one can control only general memory allocation, the
zones kernel internally is transparent on this level, so if one has OOM
killer or lockup  having allocated just ie 3G out of 4G this just looks
like a bug, well if it is documented I would call it gotcha.

> 
> About the ZONE_NORMAL shortage without swap, rather than running
> cpu-cache-hungry memcopies from lowmemzone to highmem (or even worse to
> pass through swap like it happens in 2.6 mainline with swap enabled), I
> believe it's better to reserve some ram in the lowmem zone, 800M of ram
> on a 32G box should be a cheap price to pay compared to the cpu/IO cost
> involved in moving memory around during the bench.

Right.  On other hand whatever performance problem is other class of the
problem than lockups and OOM kills.  Poor performance is much less
critical problem than lockups and firing OOM without good reason,
especially if user has a way to tune kernel to improve performance.

-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-15  0:04         ` Andrea Arcangeli
  2004-07-15  0:43           ` Peter Zaitsev
@ 2004-07-15  0:43           ` William Lee Irwin III
  2004-07-15  1:04             ` Peter Zaitsev
  1 sibling, 1 reply; 33+ messages in thread
From: William Lee Irwin III @ 2004-07-15  0:43 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Peter Zaitsev, Andrew Morton, linux-kernel

On Thu, Jul 15, 2004 at 02:04:38AM +0200, Andrea Arcangeli wrote:
> About the ZONE_NORMAL shortage without swap, rather than running
> cpu-cache-hungry memcopies from lowmemzone to highmem (or even worse to
> pass through swap like it happens in 2.6 mainline with swap enabled), I
> believe it's better to reserve some ram in the lowmem zone, 800M of ram
> on a 32G box should be a cheap price to pay compared to the cpu/IO cost
> involved in moving memory around during the bench.

I wouldn't be so quick to dismiss it. There are enough physical
placement issues already even without pinned userspace pages.
Empowering the kernel to remain largely oblivious to physical placement
of userspace pages would be very convenient and eliminate many problems.
Also, given that the alternatives are IO and allocation failure, I
wouldn't be so quick to dismiss it as slow, either. Unfortunately even
page relocation is only a half-measure while mapping->gfp_mask persists.

-- wli

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-15  0:30           ` Peter Zaitsev
@ 2004-07-15  0:46             ` Andrea Arcangeli
  2004-07-15  1:54             ` William Lee Irwin III
  1 sibling, 0 replies; 33+ messages in thread
From: Andrea Arcangeli @ 2004-07-15  0:46 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: Andrew Morton, linux-kernel

On Wed, Jul 14, 2004 at 05:30:52PM -0700, Peter Zaitsev wrote:
> In this case I also do not understand how swap space helps here ? If you

it helps because you're not in a lowmem shortage condition, adding swap
only prevents the oom killer to be invoked.

If you were in a lowmem shortage condition you would be deadlocking with
2.6.7 AFIK.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-15  0:43           ` William Lee Irwin III
@ 2004-07-15  1:04             ` Peter Zaitsev
  2004-07-15  1:29               ` William Lee Irwin III
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zaitsev @ 2004-07-15  1:04 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Andrea Arcangeli, Andrew Morton, linux-kernel

On Wed, 2004-07-14 at 17:43, William Lee Irwin III wrote:

> 
> I wouldn't be so quick to dismiss it. There are enough physical
> placement issues already even without pinned userspace pages.

Hi,

You and Andrey are mentioning "pinned" pages. Does this corresponds to 
locked (ie memlock()) pages ? 

In my case I did not have any locked pages at all. 


-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-15  1:04             ` Peter Zaitsev
@ 2004-07-15  1:29               ` William Lee Irwin III
  0 siblings, 0 replies; 33+ messages in thread
From: William Lee Irwin III @ 2004-07-15  1:29 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: Andrea Arcangeli, Andrew Morton, linux-kernel

On Wed, 2004-07-14 at 17:43, William Lee Irwin III wrote:
>> I wouldn't be so quick to dismiss it. There are enough physical
>> placement issues already even without pinned userspace pages.

On Wed, Jul 14, 2004 at 06:04:45PM -0700, Peter Zaitsev wrote:
> You and Andrey are mentioning "pinned" pages. Does this corresponds to 
> locked (ie memlock()) pages ? 
> In my case I did not have any locked pages at all. 

All anonymous userspace pages are now pinned when there is no swap.


-- wli

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-15  0:30           ` Peter Zaitsev
  2004-07-15  0:46             ` Andrea Arcangeli
@ 2004-07-15  1:54             ` William Lee Irwin III
  2004-07-15  2:13               ` Peter Zaitsev
                                 ` (2 more replies)
  1 sibling, 3 replies; 33+ messages in thread
From: William Lee Irwin III @ 2004-07-15  1:54 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: Andrew Morton, andrea, linux-kernel

On Wed, 2004-07-14 at 15:44, Andrew Morton wrote:
>> If the kernel has no swap there is nothing it can do with an anonymous page
>> (ie: the thing whcih malloc() gives you).  It is effectively pinned memory,
>> because there's nowhere we can write it to get rid of it.

On Wed, Jul 14, 2004 at 05:30:52PM -0700, Peter Zaitsev wrote:
> Why can't it be moved to other zone if there is a lot of place where ?
> In general I was not pushing system in some kind of stress mode - There
> was still a lot of cache memory available. Why it could not be instead
> shrunk to accommodate allocation ? 

The only method the kernel now has to relocate userspace memory is IO.
When mlocked, or if anonymous when there's no swap, it's pinned.


On Wed, Jul 14, 2004 at 05:30:52PM -0700, Peter Zaitsev wrote:
> As I understand in my case with 4G there is  Normal zone and HighMem
> zone where "user" anonymous memory can be located in any of these zones.
> Is this observation correct ? 

Yes.


On Wed, 2004-07-14 at 15:44, Andrew Morton wrote:
>> If you end up pinning all of your ZONE_NORMAL pages with anonymous memory,
>> further GFP_KERNEL allocation attempts will go oom.

On Wed, Jul 14, 2004 at 05:30:52PM -0700, Peter Zaitsev wrote:
> Aha I see. So user level memory allocations can't cause OOM only kernel
> level allocations can ?   In this case why do not you have some reserved
> amount of space for these types of allocations by default ? 

Userspace allocations can also trigger OOM, it's merely that in this
case only allocations restricted to ZONE_NORMAL or below, e.g. kernel
allocations, are affected. Your memory pressure is restricted to one zone.


On Wed, Jul 14, 2004 at 05:30:52PM -0700, Peter Zaitsev wrote:
> In this case I also do not understand how swap space helps here ? If you
> can't move page to over zone or shrink cache because of allocation type
> how it happens you can however perform page swap ? 

In order to relocate a userspace page, the kernel performs IO to write
the page to some backing store, then lazily faults it back in later. When
the userspace page lacks a backing store, e.g. anonymous pages on
swapless systems, Linux does not now understand how to relocate them.


-- wli

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-15  1:54             ` William Lee Irwin III
@ 2004-07-15  2:13               ` Peter Zaitsev
  2004-07-15  2:33                 ` William Lee Irwin III
  2004-07-18 16:13               ` Kurt Garloff
  2004-07-19 20:21               ` Bill Davidsen
  2 siblings, 1 reply; 33+ messages in thread
From: Peter Zaitsev @ 2004-07-15  2:13 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Andrew Morton, andrea, linux-kernel

On Wed, 2004-07-14 at 18:54, William Lee Irwin III wrote:

> The only method the kernel now has to relocate userspace memory is IO.
> When mlocked, or if anonymous when there's no swap, it's pinned.

OK. So it is practically technical difficulty rather than fundamental
reason ?   Why "move to other zone" way is not implemented ? It normally
should be cheaper than IO ?

> On Wed, Jul 14, 2004 at 05:30:52PM -0700, Peter Zaitsev wrote:
> > Aha I see. So user level memory allocations can't cause OOM only kernel
> > level allocations can ?   In this case why do not you have some reserved
> > amount of space for these types of allocations by default ? 
> 
> Userspace allocations can also trigger OOM, it's merely that in this
> case only allocations restricted to ZONE_NORMAL or below, e.g. kernel
> allocations, are affected. Your memory pressure is restricted to one zone.

Right. After being explained what without swap you have all pages pinned
it makes sense.  On other hand  why user Allocation will trigger OOM if
there are pages in other zone which still can be used ? Or are there any
restriction on this ?

> 
> In order to relocate a userspace page, the kernel performs IO to write
> the page to some backing store, then lazily faults it back in later. When
> the userspace page lacks a backing store, e.g. anonymous pages on
> swapless systems, Linux does not now understand how to relocate them.

Can't it just be just (theoretically) moved to other zone with
appropriate system tables modifications ? 

Well anyway it is good to hear "pinned anonymous" is only issue on
swapless systems.   Together with the fact what 2.6 VM does not seems to
swap without a good reason as 2.4 one did, I perhaps can just have swap
file enabled. 

-- 
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-15  2:13               ` Peter Zaitsev
@ 2004-07-15  2:33                 ` William Lee Irwin III
  2004-07-15  2:39                   ` William Lee Irwin III
  2004-07-19 20:27                   ` Bill Davidsen
  0 siblings, 2 replies; 33+ messages in thread
From: William Lee Irwin III @ 2004-07-15  2:33 UTC (permalink / raw)
  To: Peter Zaitsev; +Cc: Andrew Morton, andrea, linux-kernel

On Wed, 2004-07-14 at 18:54, William Lee Irwin III wrote:
>> The only method the kernel now has to relocate userspace memory is IO.
>> When mlocked, or if anonymous when there's no swap, it's pinned.

On Wed, Jul 14, 2004 at 07:13:23PM -0700, Peter Zaitsev wrote:
> OK. So it is practically technical difficulty rather than fundamental
> reason ?   Why "move to other zone" way is not implemented ? It normally
> should be cheaper than IO ?

There is no technical difficulty, however, do notice there are other forms
of placement-restricted pagecache, i.e. blockdev pagecache, ramdisks, etc.


On Wed, 2004-07-14 at 18:54, William Lee Irwin III wrote:
>> Userspace allocations can also trigger OOM, it's merely that in this
>> case only allocations restricted to ZONE_NORMAL or below, e.g. kernel
>> allocations, are affected. Your memory pressure is restricted to one zone.

On Wed, Jul 14, 2004 at 07:13:23PM -0700, Peter Zaitsev wrote:
> Right. After being explained what without swap you have all pages pinned
> it makes sense.  On other hand  why user Allocation will trigger OOM if
> there are pages in other zone which still can be used ? Or are there any
> restriction on this ?

Allocations can be requested to come from restricted physical areas.
In this kind of situation, the OOM comes from exhaustion of a physical
area smaller than all of RAM, i.e. ZONE_NORMAL or ZONE_DMA.

The OOM decision-making is noteworthy:
        do_retry = 0;
        if (!(gfp_mask & __GFP_NORETRY)) {
                if ((order <= 3) || (gfp_mask & __GFP_REPEAT))
                        do_retry = 1;
                if (gfp_mask & __GFP_NOFAIL)
                        do_retry = 1;
        }
        if (do_retry) {
                blk_congestion_wait(WRITE, HZ/50);
                goto rebalance;
        }

At the rebalance label, failure will only be delivered when the
check if (current->flags & (PF_MEMALLOC|PF_MEMDIE)), otherwise,
__alloc_pages() retries indefinitely and ignores signals.

Furthermore, notice the OOM killer will trip if out_of_memory() is
called more than 10 times in one second, which is plausible for a
single process to do, as it only sleeps for HZ/50 jiffies. More
interestingly, out_of_memory() is never called unless __GFP_FS is set.


On Wed, 2004-07-14 at 18:54, William Lee Irwin III wrote:
>> In order to relocate a userspace page, the kernel performs IO to write
>> the page to some backing store, then lazily faults it back in later. When
>> the userspace page lacks a backing store, e.g. anonymous pages on
>> swapless systems, Linux does not now understand how to relocate them.

On Wed, Jul 14, 2004 at 07:13:23PM -0700, Peter Zaitsev wrote:
> Can't it just be just (theoretically) moved to other zone with
> appropriate system tables modifications ? 
> Well anyway it is good to hear "pinned anonymous" is only issue on
> swapless systems.   Together with the fact what 2.6 VM does not seems to
> swap without a good reason as 2.4 one did, I perhaps can just have swap
> file enabled. 

There is no technical (or even practical) obstacle to implementing
in-core page relocation, only a social one: kernel politics. I would not
be surprised if hotplug memory patches already had code usable for this.


-- wli

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-15  2:33                 ` William Lee Irwin III
@ 2004-07-15  2:39                   ` William Lee Irwin III
  2004-07-15  2:44                     ` William Lee Irwin III
  2004-07-19 20:27                   ` Bill Davidsen
  1 sibling, 1 reply; 33+ messages in thread
From: William Lee Irwin III @ 2004-07-15  2:39 UTC (permalink / raw)
  To: Peter Zaitsev, Andrew Morton, andrea, linux-kernel

On Wed, Jul 14, 2004 at 07:33:00PM -0700, William Lee Irwin III wrote:
> At the rebalance label, failure will only be delivered when the
> check if (current->flags & (PF_MEMALLOC|PF_MEMDIE)), otherwise,
> __alloc_pages() retries indefinitely and ignores signals.

Careful not to make too much of ignoring signals, mm/oom_kill.c sets
PF_MEMDIE out-of-context, so when an OOM kill is issued while a task
is looping in __alloc_pages() it will eventually break out of the
rebalance loop due to the flag.


-- wli

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-15  2:39                   ` William Lee Irwin III
@ 2004-07-15  2:44                     ` William Lee Irwin III
  2004-08-13 22:23                       ` William Lee Irwin III
  0 siblings, 1 reply; 33+ messages in thread
From: William Lee Irwin III @ 2004-07-15  2:44 UTC (permalink / raw)
  To: Peter Zaitsev, Andrew Morton, andrea, linux-kernel

On Wed, Jul 14, 2004 at 07:39:51PM -0700, William Lee Irwin III wrote:
> Careful not to make too much of ignoring signals, mm/oom_kill.c sets
> PF_MEMDIE out-of-context, so when an OOM kill is issued while a task
> is looping in __alloc_pages() it will eventually break out of the
> rebalance loop due to the flag.

However, note the modifications of task->flags are not atomic. In
principle, one may have:

__alloc_pages()			__oom_kill_task()
load current->flags		load current->flags
|= PF_MEMALLOC in registers	|= PF_MEMALLOC|PF_MEMDIE in registers
IRQ/delay/whatever		store current->flags
store current->flags		...
try_to_free_pages() etc.	force_sig() etc.

... and voila! PF_MEMDIE in ->flags has been lost.

-- wli

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-15  1:54             ` William Lee Irwin III
  2004-07-15  2:13               ` Peter Zaitsev
@ 2004-07-18 16:13               ` Kurt Garloff
  2004-07-20  9:14                 ` R. J. Wysocki
                                   ` (2 more replies)
  2004-07-19 20:21               ` Bill Davidsen
  2 siblings, 3 replies; 33+ messages in thread
From: Kurt Garloff @ 2004-07-18 16:13 UTC (permalink / raw)
  To: linux-kernel; +Cc: William Lee Irwin III, Peter Zaitsev, Andrew Morton, andrea

[-- Attachment #1: Type: text/plain, Size: 953 bytes --]

Hi,

On Wed, Jul 14, 2004 at 06:54:31PM -0700, William Lee Irwin III wrote:
> On Wed, Jul 14, 2004 at 05:30:52PM -0700, Peter Zaitsev wrote:
> > Why can't it be moved to other zone if there is a lot of place where ?
> > In general I was not pushing system in some kind of stress mode - There
> > was still a lot of cache memory available. Why it could not be instead
> > shrunk to accommodate allocation ? 
> 
> The only method the kernel now has to relocate userspace memory is IO.

But that could be changed.
If we can swap out and modify the page tables (to mark the page paged
out) and page in to some other location (and modify the pagetables
again), we can as well just copy a page and modify the page tables.

Any fundamental reason why that should not be possible? 

Regards,
-- 
Kurt Garloff  <garloff@suse.de>                            Cologne, DE 
SUSE LINUX AG / Novell, Nuernberg, DE               Director SUSE Labs

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-15  1:54             ` William Lee Irwin III
  2004-07-15  2:13               ` Peter Zaitsev
  2004-07-18 16:13               ` Kurt Garloff
@ 2004-07-19 20:21               ` Bill Davidsen
  2 siblings, 0 replies; 33+ messages in thread
From: Bill Davidsen @ 2004-07-19 20:21 UTC (permalink / raw)
  To: linux-kernel

William Lee Irwin III wrote:
> On Wed, 2004-07-14 at 15:44, Andrew Morton wrote:
> 
>>>If the kernel has no swap there is nothing it can do with an anonymous page
>>>(ie: the thing whcih malloc() gives you).  It is effectively pinned memory,
>>>because there's nowhere we can write it to get rid of it.
> 
> 
> On Wed, Jul 14, 2004 at 05:30:52PM -0700, Peter Zaitsev wrote:
> 
>>Why can't it be moved to other zone if there is a lot of place where ?
>>In general I was not pushing system in some kind of stress mode - There
>>was still a lot of cache memory available. Why it could not be instead
>>shrunk to accommodate allocation ? 
> 
> 
> The only method the kernel now has to relocate userspace memory is IO.
> When mlocked, or if anonymous when there's no swap, it's pinned.
> 
> 
> On Wed, Jul 14, 2004 at 05:30:52PM -0700, Peter Zaitsev wrote:
> 
>>As I understand in my case with 4G there is  Normal zone and HighMem
>>zone where "user" anonymous memory can be located in any of these zones.
>>Is this observation correct ? 
> 
> 
> Yes.
> 
> 
> On Wed, 2004-07-14 at 15:44, Andrew Morton wrote:
> 
>>>If you end up pinning all of your ZONE_NORMAL pages with anonymous memory,
>>>further GFP_KERNEL allocation attempts will go oom.
> 
> 
> On Wed, Jul 14, 2004 at 05:30:52PM -0700, Peter Zaitsev wrote:
> 
>>Aha I see. So user level memory allocations can't cause OOM only kernel
>>level allocations can ?   In this case why do not you have some reserved
>>amount of space for these types of allocations by default ? 
> 
> 
> Userspace allocations can also trigger OOM, it's merely that in this
> case only allocations restricted to ZONE_NORMAL or below, e.g. kernel
> allocations, are affected. Your memory pressure is restricted to one zone.
> 
> 
> On Wed, Jul 14, 2004 at 05:30:52PM -0700, Peter Zaitsev wrote:
> 
>>In this case I also do not understand how swap space helps here ? If you
>>can't move page to over zone or shrink cache because of allocation type
>>how it happens you can however perform page swap ? 
> 
> 
> In order to relocate a userspace page, the kernel performs IO to write
> the page to some backing store, then lazily faults it back in later. When
> the userspace page lacks a backing store, e.g. anonymous pages on
> swapless systems, Linux does not now understand how to relocate them.

Can you briefly explain why the obvious method of moving a page from 
point A to point B, both in physical memory, can't be used? Or even the 
less obvious marking of some physical memory as swap space.

Clearly if this was as easy as it looks it would have been done, I just 
don't quite follow why it isn't easy.

And on a related topic, there was a way in 2.4 kernels to designate part 
of physical memory as swap. It was really useful if you had one of those 
386 chipsets which could only cache 64MB, and more memory than that. It 
was long ago enough that I can't remember if that was a feature or a 
patch. Actually so long ago it could have been 2.2.xx on that machine, I 
just booted it an ran it for years.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-15  2:33                 ` William Lee Irwin III
  2004-07-15  2:39                   ` William Lee Irwin III
@ 2004-07-19 20:27                   ` Bill Davidsen
  1 sibling, 0 replies; 33+ messages in thread
From: Bill Davidsen @ 2004-07-19 20:27 UTC (permalink / raw)
  To: linux-kernel

William Lee Irwin III wrote:

> There is no technical (or even practical) obstacle to implementing
> in-core page relocation, only a social one: kernel politics. I would not
> be surprised if hotplug memory patches already had code usable for this.

Hopefully that will not prevent them from being put in the kernel :-(

Question: if I create a ramdisk and put a swapfile on that, is it enough 
  to solve the problem, or does the swap have to be on real disk?

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-18 16:13               ` Kurt Garloff
@ 2004-07-20  9:14                 ` R. J. Wysocki
  2004-07-20 13:29                 ` Andrea Arcangeli
  2004-07-20 13:29                 ` William Lee Irwin III
  2 siblings, 0 replies; 33+ messages in thread
From: R. J. Wysocki @ 2004-07-20  9:14 UTC (permalink / raw)
  To: Kurt Garloff, linux-kernel
  Cc: William Lee Irwin III, Peter Zaitsev, Andrew Morton, andrea

On Sunday 18 of July 2004 18:13, Kurt Garloff wrote:
> Hi,
>
> On Wed, Jul 14, 2004 at 06:54:31PM -0700, William Lee Irwin III wrote:
> > On Wed, Jul 14, 2004 at 05:30:52PM -0700, Peter Zaitsev wrote:
> > > Why can't it be moved to other zone if there is a lot of place where ?
> > > In general I was not pushing system in some kind of stress mode - There
> > > was still a lot of cache memory available. Why it could not be instead
> > > shrunk to accommodate allocation ?
> >
> > The only method the kernel now has to relocate userspace memory is IO.
>
> But that could be changed.
> If we can swap out and modify the page tables (to mark the page paged
> out) and page in to some other location (and modify the pagetables
> again), we can as well just copy a page and modify the page tables.

Actually we don't need to swap out any pages for this purpose.  It suffices to 
modify the page tables to indicate that certain pages should be _moved_ to 
some other locations, IM(H)O.

Yours,
rjw

-- 
Rafael J. Wysocki
----------------------------
For a successful technology, reality must take precedence over public 
relations, for nature cannot be fooled.
					-- Richard P. Feynman

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-18 16:13               ` Kurt Garloff
  2004-07-20  9:14                 ` R. J. Wysocki
@ 2004-07-20 13:29                 ` Andrea Arcangeli
  2004-07-20 13:53                   ` William Lee Irwin III
  2004-07-20 13:29                 ` William Lee Irwin III
  2 siblings, 1 reply; 33+ messages in thread
From: Andrea Arcangeli @ 2004-07-20 13:29 UTC (permalink / raw)
  To: Kurt Garloff, linux-kernel, William Lee Irwin III, Peter Zaitsev,
	Andrew Morton

On Sun, Jul 18, 2004 at 06:13:38PM +0200, Kurt Garloff wrote:
> Any fundamental reason why that should not be possible? 

of course not, though copying mbytes of data around is expensive, and
relocation is a low priority compared to allocating ram in the right
place with heavy imbalances.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-18 16:13               ` Kurt Garloff
  2004-07-20  9:14                 ` R. J. Wysocki
  2004-07-20 13:29                 ` Andrea Arcangeli
@ 2004-07-20 13:29                 ` William Lee Irwin III
  2 siblings, 0 replies; 33+ messages in thread
From: William Lee Irwin III @ 2004-07-20 13:29 UTC (permalink / raw)
  To: Kurt Garloff, linux-kernel, Peter Zaitsev, Andrew Morton, andrea

On Wed, Jul 14, 2004 at 06:54:31PM -0700, William Lee Irwin III wrote:
>> The only method the kernel now has to relocate userspace memory is IO.

On Sun, Jul 18, 2004 at 06:13:38PM +0200, Kurt Garloff wrote:
> But that could be changed. If we can swap out and modify the page
> tables (to mark the page paged out) and page in to some other
> location (and modify the pagetables again), we can as well just copy
> a page and modify the page tables.
> Any fundamental reason why that should not be possible? 

No fundamental reasons, no. Just social ones (holy penguin pee).


-- wli

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-20 13:29                 ` Andrea Arcangeli
@ 2004-07-20 13:53                   ` William Lee Irwin III
  0 siblings, 0 replies; 33+ messages in thread
From: William Lee Irwin III @ 2004-07-20 13:53 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Kurt Garloff, linux-kernel, Peter Zaitsev, Andrew Morton

On Sun, Jul 18, 2004 at 06:13:38PM +0200, Kurt Garloff wrote:
>> Any fundamental reason why that should not be possible? 

On Tue, Jul 20, 2004 at 09:29:16AM -0400, Andrea Arcangeli wrote:
> of course not, though copying mbytes of data around is expensive, and
> relocation is a low priority compared to allocating ram in the right
> place with heavy imbalances.

The bias is good to have also; when it's possible correctly place
things in advance I like to see that happen. Gracefully recovering
if/when that doesn't work out is all I'd like to have beyond that.


-- wli

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: VM Problems in 2.6.7 (Too active OOM Killer)
  2004-07-15  2:44                     ` William Lee Irwin III
@ 2004-08-13 22:23                       ` William Lee Irwin III
  0 siblings, 0 replies; 33+ messages in thread
From: William Lee Irwin III @ 2004-08-13 22:23 UTC (permalink / raw)
  To: Peter Zaitsev, Andrew Morton, andrea, linux-kernel

On Wed, Jul 14, 2004 at 07:39:51PM -0700, William Lee Irwin III wrote:
>> Careful not to make too much of ignoring signals, mm/oom_kill.c sets
>> PF_MEMDIE out-of-context, so when an OOM kill is issued while a task
>> is looping in __alloc_pages() it will eventually break out of the
>> rebalance loop due to the flag.

On Wed, Jul 14, 2004 at 07:44:47PM -0700, William Lee Irwin III wrote:
> However, note the modifications of task->flags are not atomic. In
> principle, one may have:
> __alloc_pages()			__oom_kill_task()
> load current->flags		load current->flags
> |= PF_MEMALLOC in registers	|= PF_MEMALLOC|PF_MEMDIE in registers
> IRQ/delay/whatever		store current->flags
> store current->flags		...
> try_to_free_pages() etc.	force_sig() etc.
> ... and voila! PF_MEMDIE in ->flags has been lost.

I have a testcase that panics in mm/oom_kill.c (no processes left) on
several kinds of machines with weaker memory consistency, but does not
on x86-64. I suspect it is related to lack of task->flags atomicity.


-- wli

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2004-08-13 22:23 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-07-14  2:23 VM Problems in 2.6.7 (Too active OOM Killer) Peter Zaitsev
2004-07-14  2:40 ` William Lee Irwin III
2004-07-14  3:20   ` Peter Zaitsev
2004-07-14  3:17 ` Andrea Arcangeli
2004-07-14  3:44   ` Peter Zaitsev
2004-07-14  4:10     ` Andrea Arcangeli
2004-07-14  4:22       ` Andrew Morton
2004-07-14  4:47         ` Andrea Arcangeli
2004-07-14  4:17     ` Andrew Morton
2004-07-14 23:47       ` Peter Zaitsev
2004-07-14 22:44         ` Andrew Morton
2004-07-15  0:06           ` Andrea Arcangeli
2004-07-15  0:30           ` Peter Zaitsev
2004-07-15  0:46             ` Andrea Arcangeli
2004-07-15  1:54             ` William Lee Irwin III
2004-07-15  2:13               ` Peter Zaitsev
2004-07-15  2:33                 ` William Lee Irwin III
2004-07-15  2:39                   ` William Lee Irwin III
2004-07-15  2:44                     ` William Lee Irwin III
2004-08-13 22:23                       ` William Lee Irwin III
2004-07-19 20:27                   ` Bill Davidsen
2004-07-18 16:13               ` Kurt Garloff
2004-07-20  9:14                 ` R. J. Wysocki
2004-07-20 13:29                 ` Andrea Arcangeli
2004-07-20 13:53                   ` William Lee Irwin III
2004-07-20 13:29                 ` William Lee Irwin III
2004-07-19 20:21               ` Bill Davidsen
2004-07-15  0:04         ` Andrea Arcangeli
2004-07-15  0:43           ` Peter Zaitsev
2004-07-15  0:43           ` William Lee Irwin III
2004-07-15  1:04             ` Peter Zaitsev
2004-07-15  1:29               ` William Lee Irwin III
2004-07-14  3:50   ` William Lee Irwin III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox