Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Balbir Singh <balbir@linux.vnet.ibm.com>
To: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Pavel Emelianov <xemul@openvz.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)
Date: Mon, 08 Oct 2007 08:24:20 +0530	[thread overview]
Message-ID: <47099BDC.7080701@linux.vnet.ibm.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0710071758210.13172@blonde.wat.veritas.com>

Hugh Dickins wrote:
> On Fri, 5 Oct 2007, Balbir Singh wrote:
>> Hugh Dickins wrote:
>>> That's where it should happen, yes; but my point is that it very
>>> often does not.  Because the swap cache page (read in as part of
>>> the readaround cluster of some other cgroup, or in swapoff by some
>>> other cgroup) is already assigned to that other cgroup (by the
>>> mem_cgroup_cache_charge in __add_to_swap_cache), and so goes "The
>>> page_cgroup exists and the page has already been accounted" route
>>> when mem_cgroup_charge is called from do_swap_page.  Doesn't it?
>>>
>> You are right, at this point I am beginning to wonder if I should
>> account for the swap cache at all? We account for the pages in RSS
>> and when the page comes back into the page table(s) via do_swap_page.
>> If we believe that the swap cache is transitional and the current
>> expected working behaviour does not seem right or hard to fix,
>> it might be easy to ignore unuse_pte() and add/remove_from_swap_cache()
>> for accounting and control.
> 
> It would be wrong to ignore the unuse_pte() case: what it's intending
> to do is correct, it's just being prevented by the swapcache issue
> from doing what it intends at present.
> 

OK

> (Though I'm not thrilled with the idea of it causing an admin's
> swapoff to fail because of a cgroup reaching mem limit there, I do
> agree with your earlier argument that that's the right thing to happen,
> and it's up to the admin to fix things up - my original objection came
> from not realizing that normally the cgroup will reclaim from itself
> to free its mem. 

I'm glad we have that sorted out.

Hmm, would the charge fail or the mm get OOM'ed?)
> 

Right now, we OOM if charging and reclaim fails.


> Ignoring add_to/remove_from swap cache is what I've tried before,
> and again today.  It's not enough: if you trying run a memhog
> (something that allocates and touches more memory than the cgroup
> is allowed, relying on pushing out to swap to complete), then that
> works well with the present accounting in add_to/remove_from swap
> cache, but it OOMs once I remove the memcontrol mods from
> mm/swap_state.c.  I keep going back to investigate why, keep on
> thinking I understand it, then later realize I don't.  Please
> give it a try, I hope you've got better mental models than I have.
> 

I will try it. Another way to try it, is to set memory.control_type
to 1, that removes charging of cache pages (both swap cache
and page cache). I just did a quick small test on the memory
controller with swap cache changes disabled and it worked fine
for me on my UML image (without OOMing). I'll try the same test
on a bigger box. Disabling swap does usually cause an
OOM for workloads using anonymous pages if the cgroup goes
over it's limit (since the cgroup cannot pushout memory).

> And I don't think it will be enough to handle shmem/tmpfs either;
> but won't worry about that until we've properly understood why
> exempting swapcache leads to those OOMs, and fixed that up.
> 

Sure.


>>> Are we misunderstanding each other, because I'm assuming
>>> MEM_CGROUP_TYPE_ALL and you're assuming MEM_CGROUP_TYPE_MAPPED?
>>> though I can't see that _MAPPED and _CACHED are actually supported,
>>> there being no reference to them outside the enum that defines them.
>> I am also assuming MEM_CGROUP_TYPE_ALL for the purpose of our
>> discussion. The accounting is split into mem_cgroup_charge() and
>> mem_cgroup_cache_charge(). While charging the caches is when we
>> check for the control_type.
> 
> It checks MEM_CGROUP_TYPE_ALL there, yes; but I can't find anything
> checking for either MEM_CGROUP_TYPE_MAPPED or MEM_CGROUP_TYPE_CACHED.
> (Or is it hidden in one of those preprocesor ## things which frustrate
> both my greps and me!?)
> 

MEM_CGROUP_TYPE_ALL is defined to be (MEM_CGROUP_TYPE_CACHED |
MEM_CGROUP_TYPE_MAPPED). I'll make that more explicit with a patch.
When the type is not MEM_CGROUP_TYPE_ALL, cached pages are ignored.

>>> Or are you deceived by that ifdef NUMA code in swapin_readahead,
>>> which propagates the fantasy that swap allocation follows vma layout?
>>> That nonsense has been around too long, I'll soon be sending a patch
>>> to remove it.
>> The swapin readahead code under #ifdef NUMA is very confusing.
> 
> I sent a patch to linux-mm last night, to remove that confusion.
> 

Thanks, I saw that.

>> I also
>> noticed another confusing thing during my test, swap cache does not
>> drop to 0, even though I've disabled all swap using swapoff. May be
>> those are tmpfs pages. The other interesting thing I tried was running
>> swapoff after a cgroup went over it's limit, the swapoff succeeded,
>> but I see strange numbers for free swap. I'll start another thread
>> after investigating a bit more.
> 
> Those indeed are strange behaviours (if the swapoff really has
> succeeded, rather than lying), I not seen such and don't have an
> explanation.  tmpfs doesn't add any weirdness there: when there's
> no swap, there can be no swap cache.  Or is the swapoff still in
> progress?  While it's busy, we keep /proc/meminfo looking sensible,
> but <Alt><SysRq>m can show negative free swap (IIRC).
> 
> I'll be interested to hear what your investigation shows.
> 

With the new OOM killer changes, I see negative swap. When I run swapoff
with a memory hogger workload, I see (after swapoff succeeds)

....
Swap cache: add 473215, delete 473214, find 31744/36688, race 0+0
Free swap  = 18446744073709105092kB
Total swap = 0kB
Free swap:       -446524kB
...



> Hugh


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

WARNING: multiple messages have this Message-ID (diff)

From: Balbir Singh <balbir@linux.vnet.ibm.com>
To: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Pavel Emelianov <xemul@openvz.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)
Date: Mon, 08 Oct 2007 08:24:20 +0530	[thread overview]
Message-ID: <47099BDC.7080701@linux.vnet.ibm.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0710071758210.13172@blonde.wat.veritas.com>

Hugh Dickins wrote:
> On Fri, 5 Oct 2007, Balbir Singh wrote:
>> Hugh Dickins wrote:
>>> That's where it should happen, yes; but my point is that it very
>>> often does not.  Because the swap cache page (read in as part of
>>> the readaround cluster of some other cgroup, or in swapoff by some
>>> other cgroup) is already assigned to that other cgroup (by the
>>> mem_cgroup_cache_charge in __add_to_swap_cache), and so goes "The
>>> page_cgroup exists and the page has already been accounted" route
>>> when mem_cgroup_charge is called from do_swap_page.  Doesn't it?
>>>
>> You are right, at this point I am beginning to wonder if I should
>> account for the swap cache at all? We account for the pages in RSS
>> and when the page comes back into the page table(s) via do_swap_page.
>> If we believe that the swap cache is transitional and the current
>> expected working behaviour does not seem right or hard to fix,
>> it might be easy to ignore unuse_pte() and add/remove_from_swap_cache()
>> for accounting and control.
> 
> It would be wrong to ignore the unuse_pte() case: what it's intending
> to do is correct, it's just being prevented by the swapcache issue
> from doing what it intends at present.
> 

OK

> (Though I'm not thrilled with the idea of it causing an admin's
> swapoff to fail because of a cgroup reaching mem limit there, I do
> agree with your earlier argument that that's the right thing to happen,
> and it's up to the admin to fix things up - my original objection came
> from not realizing that normally the cgroup will reclaim from itself
> to free its mem. 

I'm glad we have that sorted out.

Hmm, would the charge fail or the mm get OOM'ed?)
> 

Right now, we OOM if charging and reclaim fails.


> Ignoring add_to/remove_from swap cache is what I've tried before,
> and again today.  It's not enough: if you trying run a memhog
> (something that allocates and touches more memory than the cgroup
> is allowed, relying on pushing out to swap to complete), then that
> works well with the present accounting in add_to/remove_from swap
> cache, but it OOMs once I remove the memcontrol mods from
> mm/swap_state.c.  I keep going back to investigate why, keep on
> thinking I understand it, then later realize I don't.  Please
> give it a try, I hope you've got better mental models than I have.
> 

I will try it. Another way to try it, is to set memory.control_type
to 1, that removes charging of cache pages (both swap cache
and page cache). I just did a quick small test on the memory
controller with swap cache changes disabled and it worked fine
for me on my UML image (without OOMing). I'll try the same test
on a bigger box. Disabling swap does usually cause an
OOM for workloads using anonymous pages if the cgroup goes
over it's limit (since the cgroup cannot pushout memory).

> And I don't think it will be enough to handle shmem/tmpfs either;
> but won't worry about that until we've properly understood why
> exempting swapcache leads to those OOMs, and fixed that up.
> 

Sure.


>>> Are we misunderstanding each other, because I'm assuming
>>> MEM_CGROUP_TYPE_ALL and you're assuming MEM_CGROUP_TYPE_MAPPED?
>>> though I can't see that _MAPPED and _CACHED are actually supported,
>>> there being no reference to them outside the enum that defines them.
>> I am also assuming MEM_CGROUP_TYPE_ALL for the purpose of our
>> discussion. The accounting is split into mem_cgroup_charge() and
>> mem_cgroup_cache_charge(). While charging the caches is when we
>> check for the control_type.
> 
> It checks MEM_CGROUP_TYPE_ALL there, yes; but I can't find anything
> checking for either MEM_CGROUP_TYPE_MAPPED or MEM_CGROUP_TYPE_CACHED.
> (Or is it hidden in one of those preprocesor ## things which frustrate
> both my greps and me!?)
> 

MEM_CGROUP_TYPE_ALL is defined to be (MEM_CGROUP_TYPE_CACHED |
MEM_CGROUP_TYPE_MAPPED). I'll make that more explicit with a patch.
When the type is not MEM_CGROUP_TYPE_ALL, cached pages are ignored.

>>> Or are you deceived by that ifdef NUMA code in swapin_readahead,
>>> which propagates the fantasy that swap allocation follows vma layout?
>>> That nonsense has been around too long, I'll soon be sending a patch
>>> to remove it.
>> The swapin readahead code under #ifdef NUMA is very confusing.
> 
> I sent a patch to linux-mm last night, to remove that confusion.
> 

Thanks, I saw that.

>> I also
>> noticed another confusing thing during my test, swap cache does not
>> drop to 0, even though I've disabled all swap using swapoff. May be
>> those are tmpfs pages. The other interesting thing I tried was running
>> swapoff after a cgroup went over it's limit, the swapoff succeeded,
>> but I see strange numbers for free swap. I'll start another thread
>> after investigating a bit more.
> 
> Those indeed are strange behaviours (if the swapoff really has
> succeeded, rather than lying), I not seen such and don't have an
> explanation.  tmpfs doesn't add any weirdness there: when there's
> no swap, there can be no swap cache.  Or is the swapoff still in
> progress?  While it's busy, we keep /proc/meminfo looking sensible,
> but <Alt><SysRq>m can show negative free swap (IIRC).
> 
> I'll be interested to hear what your investigation shows.
> 

With the new OOM killer changes, I see negative swap. When I run swapoff
with a memory hogger workload, I see (after swapoff succeeds)

....
Swap cache: add 473215, delete 473214, find 31744/36688, race 0+0
Free swap  = 18446744073709105092kB
Total swap = 0kB
Free swap:       -446524kB
...



> Hugh


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2007-10-08  2:55 UTC|newest]

Thread overview: 146+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-10-01 21:22 -mm merge plans for 2.6.24 Andrew Morton
2007-10-01 21:34 ` wibbling over the cpuset shed domain connnection Paul Jackson
2007-10-02 12:36   ` Nick Piggin
2007-10-03  5:21     ` Paul Jackson
2007-10-02 13:12       ` Nick Piggin
2007-10-03  7:00         ` Paul Jackson
2007-10-03 10:57           ` Andrew Morton
2007-10-02  4:21 ` Memory controller merge (was Re: -mm merge plans for 2.6.24) Balbir Singh
2007-10-02  4:21   ` Balbir Singh
2007-10-02 15:46   ` Hugh Dickins
2007-10-02 15:46     ` Hugh Dickins
2007-10-03  8:13     ` Balbir Singh
2007-10-03  8:13       ` Balbir Singh
2007-10-03 18:47       ` Hugh Dickins
2007-10-03 18:47         ` Hugh Dickins
2007-10-04  4:16         ` Balbir Singh
2007-10-04  4:16           ` Balbir Singh
2007-10-04 13:16           ` Hugh Dickins
2007-10-04 13:16             ` Hugh Dickins
2007-10-05  3:07             ` Balbir Singh
2007-10-05  3:07               ` Balbir Singh
2007-10-07 17:41               ` Hugh Dickins
2007-10-07 17:41                 ` Hugh Dickins
2007-10-08  2:54                 ` Balbir Singh [this message]
2007-10-08  2:54                   ` Balbir Singh
2007-10-04 16:10     ` Paul Menage
2007-10-04 16:10       ` Paul Menage
2007-10-10 21:07   ` Rik van Riel
2007-10-10 21:07     ` Rik van Riel
2007-10-11  6:33     ` Balbir Singh
2007-10-11  6:33       ` Balbir Singh
2007-10-02  6:18 ` x86 patches was Re: -mm merge plans for 2.6.24 Andi Kleen
2007-10-02  6:32   ` Andrew Morton
2007-10-02  7:01     ` Andi Kleen
2007-10-02  7:18       ` Andrew Morton
2007-10-02  7:36         ` KAMEZAWA Hiroyuki
2007-10-02  7:43           ` Andrew Morton
2007-10-02  8:16             ` KAMEZAWA Hiroyuki
2007-10-02 10:48               ` Yasunori Goto
2007-10-02 18:18               ` Christoph Lameter
2007-10-02 17:25             ` Lee Schermerhorn
2007-10-02 16:40           ` Nish Aravamudan
2007-10-02 17:17           ` Lee Schermerhorn
2007-10-02 18:16           ` Christoph Lameter
2007-10-02  7:55         ` Matt Mackall
2007-10-02  7:59           ` Andi Kleen
2007-10-02  9:26       ` Andy Whitcroft
2007-10-02  7:37     ` Ingo Molnar
2007-10-02  7:46       ` Andi Kleen
2007-10-02  7:58         ` Thomas Gleixner
2007-10-02  7:59 ` v4l-stk11xx* [Was: -mm merge plans for 2.6.24] Jiri Slaby
2007-10-02  8:08 ` Wireless damage " Jiri Slaby
2007-10-02  8:10   ` Jiri Slaby
2007-10-02 17:37   ` John W. Linville
2007-10-02  8:17 ` per BDI dirty limit (was Re: -mm merge plans for 2.6.24) Peter Zijlstra
2007-10-02  8:28   ` Fengguang Wu
2007-10-02  8:28     ` Fengguang Wu
2007-10-02  8:31   ` Andrew Morton
2007-10-02  8:48     ` Peter Zijlstra
2007-10-02 10:31       ` Kay Sievers
2007-10-02 10:44         ` Peter Zijlstra
2007-10-02 10:47           ` Fengguang Wu
2007-10-02 10:47             ` Fengguang Wu
2007-10-02 11:22               ` Kay Sievers
2007-10-02 11:28                 ` Fengguang Wu
2007-10-02 11:28                   ` Fengguang Wu
2007-10-02 11:21           ` Kay Sievers
2007-10-02 11:40             ` Peter Zijlstra
2007-10-02 12:05               ` Nick Piggin
2007-10-03 10:15                 ` Kay Sievers
2007-10-03 10:37                   ` Peter Zijlstra
2007-10-03 13:35                     ` Kay Sievers
2007-10-03 13:58                       ` Peter Zijlstra
2007-10-26 14:48                       ` Peter Zijlstra
2007-10-26 15:06                         ` Miklos Szeredi
2007-10-26 15:10                         ` Kay Sievers
2007-10-26 15:22                           ` Peter Zijlstra
2007-10-26 15:33                             ` Kay Sievers
2007-10-26 15:33                               ` Peter Zijlstra
2007-10-26 15:55                                 ` Kay Sievers
2007-10-26 20:04                                   ` Peter Zijlstra
2007-10-27  1:18                                     ` Peter Zijlstra
2007-10-27  2:40                                       ` Greg KH
2007-10-27  8:39                                         ` Peter Zijlstra
2007-10-27 16:02                                           ` Greg KH
2007-10-27 16:07                                             ` Peter Zijlstra
2007-10-27 21:08                                             ` Kay Sievers
2007-10-27 21:35                                               ` Peter Zijlstra
2007-10-28  7:10                                                 ` Greg KH
2007-11-02 13:15                                               ` Peter Zijlstra
2007-11-02 13:50                                                 ` Kay Sievers
2007-11-02 13:54                                                   ` Peter Zijlstra
2007-11-02 14:17                                                   ` Peter Zijlstra
2007-11-02 14:32                                                     ` Kay Sievers
2007-11-02 14:59                                                       ` [PATCH] mm: sysfs: expose the BDI object in sysfs Peter Zijlstra
2007-11-02 15:13                                                         ` Kay Sievers
2007-10-26 16:37                         ` per BDI dirty limit (was Re: -mm merge plans for 2.6.24) Trond Myklebust
2007-12-14 14:50                           ` Peter Zijlstra
2007-12-14 15:14                             ` Miklos Szeredi
2007-12-14 15:54                               ` Peter Zijlstra
2007-10-02 14:38               ` Kay Sievers
2007-10-03 11:00   ` Martin Knoblauch
2007-10-02  8:39 ` writeback fixes Fengguang Wu
2007-10-02  8:39   ` Fengguang Wu
2007-10-02 16:06 ` kswapd min order, slub max order [was Re: -mm merge plans for 2.6.24] Hugh Dickins
2007-10-02 16:06   ` Hugh Dickins
2007-10-02  9:10   ` Nick Piggin
2007-10-02  9:10     ` Nick Piggin
2007-10-02 18:38   ` Mel Gorman
2007-10-02 18:38     ` Mel Gorman
2007-10-02 18:28     ` Christoph Lameter
2007-10-02 18:28       ` Christoph Lameter
2007-10-03  0:37       ` Christoph Lameter
2007-10-03  0:37         ` Christoph Lameter
2007-10-02 16:12 ` -mm merge plans for 2.6.24 Pekka Enberg
2007-10-02 16:21 ` new aops merge [was Re: -mm merge plans for 2.6.24] Hugh Dickins
2007-10-02 16:21   ` Hugh Dickins
2007-10-02 17:45 ` remove zero_page (was Re: -mm merge plans for 2.6.24) Nick Piggin
2007-10-02 17:45   ` Nick Piggin
2007-10-03 10:58   ` Andrew Morton
2007-10-03 10:58     ` Andrew Morton
2007-10-03 15:21   ` Linus Torvalds
2007-10-03 15:21     ` Linus Torvalds
2007-10-08 15:17     ` Nick Piggin
2007-10-08 15:17       ` Nick Piggin
2007-10-09 13:00       ` Hugh Dickins
2007-10-09 13:00         ` Hugh Dickins
2007-10-09 14:52       ` Linus Torvalds
2007-10-09 14:52         ` Linus Torvalds
2007-10-09  9:31         ` Nick Piggin
2007-10-09  9:31           ` Nick Piggin
2007-10-10  2:22           ` Linus Torvalds
2007-10-10  2:22             ` Linus Torvalds
2007-10-09 10:15             ` Nick Piggin
2007-10-09 10:15               ` Nick Piggin
2007-10-10  3:06               ` Linus Torvalds
2007-10-10  3:06                 ` Linus Torvalds
2007-10-10  4:06               ` Hugh Dickins
2007-10-10  4:06                 ` Hugh Dickins
2007-10-10  5:20                 ` Linus Torvalds
2007-10-10  5:20                   ` Linus Torvalds
2007-10-09 14:30                   ` Nick Piggin
2007-10-09 14:30                     ` Nick Piggin
2007-10-10 15:04                     ` Linus Torvalds
2007-10-10 15:04                       ` Linus Torvalds
2007-10-03 19:50 ` A kernel Tracing interface " David Wilder
2007-10-09  9:19 ` r/o bind mounts, was Re: -mm merge plans for 2.6.24 Christoph Hellwig
2007-10-13  8:44 ` Borislav Petkov
2007-10-13  8:52   ` Andrew Morton
2007-10-13 11:45     ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47099BDC.7080701@linux.vnet.ibm.com \
    --to=balbir@linux.vnet.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=hugh@veritas.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=xemul@openvz.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.