From: Balbir Singh <balbir@linux.vnet.ibm.com>
To: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Pavel Emelianov <xemul@openvz.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)
Date: Fri, 05 Oct 2007 08:37:37 +0530 [thread overview]
Message-ID: <4705AA79.9080008@linux.vnet.ibm.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0710041258530.3485@blonde.wat.veritas.com>
Hugh Dickins wrote:
> On Thu, 4 Oct 2007, Balbir Singh wrote:
>> Hugh Dickins wrote:
>>> Well, swap control is another subject. I guess for that you'll need
>>> to track which cgroup each swap page belongs to (rather more expensive
>>> than the current swap_map of unsigned shorts). And I doubt it'll be
>>> swap control as such that's required, but control of rss+swap.
>> I see what you mean now, other people have recommending a per cgroup
>> swap file/device.
>
> Sounds too inflexible, and too many swap areas to me. Perhaps the
> right answer will fall in between: assign clusters of swap pages to
> different cgroups as needed. But worry about that some other time.
>
Yes, depending on the number of cgroups, we'll need to share swap
areas between them. It requires more work and thought process.
>>> But here I'm just worrying about how the existence of swap makes
>>> something of a nonsense of your rss control.
>>>
>> Ideally, pages would not reside for too long in swap cache (unless
>
> Thinking particularly of those brought in by swapoff or swap readahead:
> some will get attached to mms once accessed, others will simply get
> freed when tasks exit or munmap, others will hang around until they
> reach the bottom of the LRU and are reclaimed again by memory pressure.
>
> But as your code stands, that'll be total memory pressure: in-cgroup
> memory pressure will tend to miss them, since typically they're
> assigned to the wrong cgroup; until then their presence is liable
> to cause other pages to be reclaimed which ideally should not be.
>
in-cgroup pressure will not affect them, since they are in different
cgroups. If there is pressure in the cgroup to which they are wrongly
assigned, they would get reclaimed first.
>> I've misunderstood swap cache or there are special cases for tmpfs/
>> ramfs).
>
> ramfs pages are always in RAM, never go out to swap, no need to
> worry about them in this regard. But tmpfs pages can indeed go
> out to swap, so whatever we come up with needs to make sense
> with them too, yes. I don't think its swapoff/readahead issues
> are any harder to handle than the anonymous mapped page case,
> but it will need its own code to handle them.
>
>> Once pages have been swapped back in, they get assigned
>> back to their respective cgroup's in do_swap_page() (where we charge
>> them back to the cgroup).
>>
>
> That's where it should happen, yes; but my point is that it very
> often does not. Because the swap cache page (read in as part of
> the readaround cluster of some other cgroup, or in swapoff by some
> other cgroup) is already assigned to that other cgroup (by the
> mem_cgroup_cache_charge in __add_to_swap_cache), and so goes "The
> page_cgroup exists and the page has already been accounted" route
> when mem_cgroup_charge is called from do_swap_page. Doesn't it?
>
You are right, at this point I am beginning to wonder if I should
account for the swap cache at all? We account for the pages in RSS
and when the page comes back into the page table(s) via do_swap_page.
If we believe that the swap cache is transitional and the current
expected working behaviour does not seem right or hard to fix,
it might be easy to ignore unuse_pte() and add/remove_from_swap_cache()
for accounting and control.
The expected working behaviour of the memory controller is that
currently, as you point out several pages get accounted to the
cgroup that initiates swapin readahead or swapoff. On
cgroup pressure (the one that initiated swapin or swapoff), the
cgroup would discard these pages first. These pages are discarded
from the cgroup, but still live on the global LRU.
When the original cgroup is under pressure, these pages might not
be effected as they belong to a different cgroup, which might not
be under any sort of pressure.
> Are we misunderstanding each other, because I'm assuming
> MEM_CGROUP_TYPE_ALL and you're assuming MEM_CGROUP_TYPE_MAPPED?
> though I can't see that _MAPPED and _CACHED are actually supported,
> there being no reference to them outside the enum that defines them.
>
I am also assuming MEM_CGROUP_TYPE_ALL for the purpose of our
discussion. The accounting is split into mem_cgroup_charge() and
mem_cgroup_cache_charge(). While charging the caches is when we
check for the control_type.
> Or are you deceived by that ifdef NUMA code in swapin_readahead,
> which propagates the fantasy that swap allocation follows vma layout?
> That nonsense has been around too long, I'll soon be sending a patch
> to remove it.
>
The swapin readahead code under #ifdef NUMA is very confusing. I also
noticed another confusing thing during my test, swap cache does not
drop to 0, even though I've disabled all swap using swapoff. May be
those are tmpfs pages. The other interesting thing I tried was running
swapoff after a cgroup went over it's limit, the swapoff succeeded,
but I see strange numbers for free swap. I'll start another thread
after investigating a bit more.
>> The swap cache pages will be the first ones to go, once the cgroup
>> exceeds its limit.
>
> No, because they're (in general) booked to the wrong cgroup.
>
I meant for the wrong cgroup, in the wrong cgroup, these will be the
first set of pages to be reclaimed.
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
next prev parent reply other threads:[~2007-10-05 3:07 UTC|newest]
Thread overview: 112+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-10-01 21:22 -mm merge plans for 2.6.24 Andrew Morton
2007-10-01 21:34 ` wibbling over the cpuset shed domain connnection Paul Jackson
2007-10-02 12:36 ` Nick Piggin
2007-10-03 5:21 ` Paul Jackson
2007-10-02 13:12 ` Nick Piggin
2007-10-03 7:00 ` Paul Jackson
2007-10-03 10:57 ` Andrew Morton
2007-10-02 4:21 ` Memory controller merge (was Re: -mm merge plans for 2.6.24) Balbir Singh
2007-10-02 15:46 ` Hugh Dickins
2007-10-03 8:13 ` Balbir Singh
2007-10-03 18:47 ` Hugh Dickins
2007-10-04 4:16 ` Balbir Singh
2007-10-04 13:16 ` Hugh Dickins
2007-10-05 3:07 ` Balbir Singh [this message]
2007-10-07 17:41 ` Hugh Dickins
2007-10-08 2:54 ` Balbir Singh
2007-10-04 16:10 ` Paul Menage
2007-10-10 21:07 ` Rik van Riel
2007-10-11 6:33 ` Balbir Singh
2007-10-02 6:18 ` x86 patches was Re: -mm merge plans for 2.6.24 Andi Kleen
2007-10-02 6:32 ` Andrew Morton
2007-10-02 7:01 ` Andi Kleen
2007-10-02 7:18 ` Andrew Morton
2007-10-02 7:36 ` KAMEZAWA Hiroyuki
2007-10-02 7:43 ` Andrew Morton
2007-10-02 8:16 ` KAMEZAWA Hiroyuki
2007-10-02 10:48 ` Yasunori Goto
2007-10-02 18:18 ` Christoph Lameter
2007-10-02 17:25 ` Lee Schermerhorn
2007-10-02 16:40 ` Nish Aravamudan
2007-10-02 17:17 ` Lee Schermerhorn
2007-10-02 18:16 ` Christoph Lameter
2007-10-02 7:55 ` Matt Mackall
2007-10-02 7:59 ` Andi Kleen
2007-10-02 9:26 ` Andy Whitcroft
2007-10-02 7:37 ` Ingo Molnar
2007-10-02 7:46 ` Andi Kleen
2007-10-02 7:58 ` Thomas Gleixner
2007-10-02 7:59 ` v4l-stk11xx* [Was: -mm merge plans for 2.6.24] Jiri Slaby
[not found] ` <4701FC79.3060608@gmail.com>
2007-10-02 8:10 ` Wireless damage " Jiri Slaby
2007-10-02 8:17 ` per BDI dirty limit (was Re: -mm merge plans for 2.6.24) Peter Zijlstra
[not found] ` <20071002082831.GA19954@mail.ustc.edu.cn>
2007-10-02 8:28 ` Fengguang Wu
2007-10-02 8:31 ` Andrew Morton
2007-10-02 8:48 ` Peter Zijlstra
2007-10-02 10:31 ` Kay Sievers
2007-10-02 10:44 ` Peter Zijlstra
[not found] ` <20071002104734.GA9410@mail.ustc.edu.cn>
2007-10-02 10:47 ` Fengguang Wu
2007-10-02 11:22 ` Kay Sievers
[not found] ` <20071002112802.GA12607@mail.ustc.edu.cn>
2007-10-02 11:28 ` Fengguang Wu
2007-10-02 11:21 ` Kay Sievers
2007-10-02 11:40 ` Peter Zijlstra
2007-10-02 12:05 ` Nick Piggin
2007-10-03 10:15 ` Kay Sievers
2007-10-03 10:37 ` Peter Zijlstra
2007-10-03 13:35 ` Kay Sievers
2007-10-03 13:58 ` Peter Zijlstra
2007-10-26 14:48 ` Peter Zijlstra
2007-10-26 15:06 ` Miklos Szeredi
2007-10-26 15:10 ` Kay Sievers
2007-10-26 15:22 ` Peter Zijlstra
2007-10-26 15:33 ` Kay Sievers
2007-10-26 15:33 ` Peter Zijlstra
2007-10-26 15:55 ` Kay Sievers
2007-10-26 20:04 ` Peter Zijlstra
2007-10-27 1:18 ` Peter Zijlstra
2007-10-27 2:40 ` Greg KH
2007-10-27 8:39 ` Peter Zijlstra
2007-10-27 16:02 ` Greg KH
2007-10-27 16:07 ` Peter Zijlstra
2007-10-27 21:08 ` Kay Sievers
2007-10-27 21:35 ` Peter Zijlstra
2007-10-28 7:10 ` Greg KH
2007-11-02 13:15 ` Peter Zijlstra
2007-11-02 13:50 ` Kay Sievers
2007-11-02 13:54 ` Peter Zijlstra
2007-11-02 14:17 ` Peter Zijlstra
2007-11-02 14:32 ` Kay Sievers
2007-11-02 14:59 ` [PATCH] mm: sysfs: expose the BDI object in sysfs Peter Zijlstra
2007-11-02 15:13 ` Kay Sievers
2007-10-26 16:37 ` per BDI dirty limit (was Re: -mm merge plans for 2.6.24) Trond Myklebust
2007-12-14 14:50 ` Peter Zijlstra
2007-12-14 15:14 ` Miklos Szeredi
2007-12-14 15:54 ` Peter Zijlstra
2007-10-02 14:38 ` Kay Sievers
2007-10-03 11:00 ` Martin Knoblauch
[not found] ` <20071002083922.GA28892@mail.ustc.edu.cn>
2007-10-02 8:39 ` writeback fixes Fengguang Wu
2007-10-02 16:06 ` kswapd min order, slub max order [was Re: -mm merge plans for 2.6.24] Hugh Dickins
2007-10-02 9:10 ` Nick Piggin
2007-10-02 18:38 ` Mel Gorman
2007-10-02 18:28 ` Christoph Lameter
2007-10-03 0:37 ` Christoph Lameter
2007-10-02 16:12 ` -mm merge plans for 2.6.24 Pekka Enberg
2007-10-02 16:21 ` new aops merge [was Re: -mm merge plans for 2.6.24] Hugh Dickins
2007-10-02 17:45 ` remove zero_page (was Re: -mm merge plans for 2.6.24) Nick Piggin
2007-10-03 10:58 ` Andrew Morton
2007-10-03 15:21 ` Linus Torvalds
2007-10-08 15:17 ` Nick Piggin
2007-10-09 13:00 ` Hugh Dickins
2007-10-09 14:52 ` Linus Torvalds
2007-10-09 9:31 ` Nick Piggin
2007-10-10 2:22 ` Linus Torvalds
2007-10-09 10:15 ` Nick Piggin
2007-10-10 3:06 ` Linus Torvalds
2007-10-10 4:06 ` Hugh Dickins
2007-10-10 5:20 ` Linus Torvalds
2007-10-09 14:30 ` Nick Piggin
2007-10-10 15:04 ` Linus Torvalds
2007-10-03 19:50 ` A kernel Tracing interface " David Wilder
2007-10-09 9:19 ` r/o bind mounts, was Re: -mm merge plans for 2.6.24 Christoph Hellwig
2007-10-13 8:44 ` Borislav Petkov
2007-10-13 8:52 ` Andrew Morton
2007-10-13 11:45 ` Borislav Petkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4705AA79.9080008@linux.vnet.ibm.com \
--to=balbir@linux.vnet.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=hugh@veritas.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=xemul@openvz.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox