public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Balbir Singh <balbir@linux.vnet.ibm.com>
To: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Pavel Emelianov <xemul@openvz.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)
Date: Mon, 08 Oct 2007 08:24:20 +0530	[thread overview]
Message-ID: <47099BDC.7080701@linux.vnet.ibm.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0710071758210.13172@blonde.wat.veritas.com>

Hugh Dickins wrote:
> On Fri, 5 Oct 2007, Balbir Singh wrote:
>> Hugh Dickins wrote:
>>> That's where it should happen, yes; but my point is that it very
>>> often does not.  Because the swap cache page (read in as part of
>>> the readaround cluster of some other cgroup, or in swapoff by some
>>> other cgroup) is already assigned to that other cgroup (by the
>>> mem_cgroup_cache_charge in __add_to_swap_cache), and so goes "The
>>> page_cgroup exists and the page has already been accounted" route
>>> when mem_cgroup_charge is called from do_swap_page.  Doesn't it?
>>>
>> You are right, at this point I am beginning to wonder if I should
>> account for the swap cache at all? We account for the pages in RSS
>> and when the page comes back into the page table(s) via do_swap_page.
>> If we believe that the swap cache is transitional and the current
>> expected working behaviour does not seem right or hard to fix,
>> it might be easy to ignore unuse_pte() and add/remove_from_swap_cache()
>> for accounting and control.
> 
> It would be wrong to ignore the unuse_pte() case: what it's intending
> to do is correct, it's just being prevented by the swapcache issue
> from doing what it intends at present.
> 

OK

> (Though I'm not thrilled with the idea of it causing an admin's
> swapoff to fail because of a cgroup reaching mem limit there, I do
> agree with your earlier argument that that's the right thing to happen,
> and it's up to the admin to fix things up - my original objection came
> from not realizing that normally the cgroup will reclaim from itself
> to free its mem. 

I'm glad we have that sorted out.

Hmm, would the charge fail or the mm get OOM'ed?)
> 

Right now, we OOM if charging and reclaim fails.


> Ignoring add_to/remove_from swap cache is what I've tried before,
> and again today.  It's not enough: if you trying run a memhog
> (something that allocates and touches more memory than the cgroup
> is allowed, relying on pushing out to swap to complete), then that
> works well with the present accounting in add_to/remove_from swap
> cache, but it OOMs once I remove the memcontrol mods from
> mm/swap_state.c.  I keep going back to investigate why, keep on
> thinking I understand it, then later realize I don't.  Please
> give it a try, I hope you've got better mental models than I have.
> 

I will try it. Another way to try it, is to set memory.control_type
to 1, that removes charging of cache pages (both swap cache
and page cache). I just did a quick small test on the memory
controller with swap cache changes disabled and it worked fine
for me on my UML image (without OOMing). I'll try the same test
on a bigger box. Disabling swap does usually cause an
OOM for workloads using anonymous pages if the cgroup goes
over it's limit (since the cgroup cannot pushout memory).

> And I don't think it will be enough to handle shmem/tmpfs either;
> but won't worry about that until we've properly understood why
> exempting swapcache leads to those OOMs, and fixed that up.
> 

Sure.


>>> Are we misunderstanding each other, because I'm assuming
>>> MEM_CGROUP_TYPE_ALL and you're assuming MEM_CGROUP_TYPE_MAPPED?
>>> though I can't see that _MAPPED and _CACHED are actually supported,
>>> there being no reference to them outside the enum that defines them.
>> I am also assuming MEM_CGROUP_TYPE_ALL for the purpose of our
>> discussion. The accounting is split into mem_cgroup_charge() and
>> mem_cgroup_cache_charge(). While charging the caches is when we
>> check for the control_type.
> 
> It checks MEM_CGROUP_TYPE_ALL there, yes; but I can't find anything
> checking for either MEM_CGROUP_TYPE_MAPPED or MEM_CGROUP_TYPE_CACHED.
> (Or is it hidden in one of those preprocesor ## things which frustrate
> both my greps and me!?)
> 

MEM_CGROUP_TYPE_ALL is defined to be (MEM_CGROUP_TYPE_CACHED |
MEM_CGROUP_TYPE_MAPPED). I'll make that more explicit with a patch.
When the type is not MEM_CGROUP_TYPE_ALL, cached pages are ignored.

>>> Or are you deceived by that ifdef NUMA code in swapin_readahead,
>>> which propagates the fantasy that swap allocation follows vma layout?
>>> That nonsense has been around too long, I'll soon be sending a patch
>>> to remove it.
>> The swapin readahead code under #ifdef NUMA is very confusing.
> 
> I sent a patch to linux-mm last night, to remove that confusion.
> 

Thanks, I saw that.

>> I also
>> noticed another confusing thing during my test, swap cache does not
>> drop to 0, even though I've disabled all swap using swapoff. May be
>> those are tmpfs pages. The other interesting thing I tried was running
>> swapoff after a cgroup went over it's limit, the swapoff succeeded,
>> but I see strange numbers for free swap. I'll start another thread
>> after investigating a bit more.
> 
> Those indeed are strange behaviours (if the swapoff really has
> succeeded, rather than lying), I not seen such and don't have an
> explanation.  tmpfs doesn't add any weirdness there: when there's
> no swap, there can be no swap cache.  Or is the swapoff still in
> progress?  While it's busy, we keep /proc/meminfo looking sensible,
> but <Alt><SysRq>m can show negative free swap (IIRC).
> 
> I'll be interested to hear what your investigation shows.
> 

With the new OOM killer changes, I see negative swap. When I run swapoff
with a memory hogger workload, I see (after swapoff succeeds)

....
Swap cache: add 473215, delete 473214, find 31744/36688, race 0+0
Free swap  = 18446744073709105092kB
Total swap = 0kB
Free swap:       -446524kB
...



> Hugh


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

  reply	other threads:[~2007-10-08  2:55 UTC|newest]

Thread overview: 112+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-10-01 21:22 -mm merge plans for 2.6.24 Andrew Morton
2007-10-01 21:34 ` wibbling over the cpuset shed domain connnection Paul Jackson
2007-10-02 12:36   ` Nick Piggin
2007-10-03  5:21     ` Paul Jackson
2007-10-02 13:12       ` Nick Piggin
2007-10-03  7:00         ` Paul Jackson
2007-10-03 10:57           ` Andrew Morton
2007-10-02  4:21 ` Memory controller merge (was Re: -mm merge plans for 2.6.24) Balbir Singh
2007-10-02 15:46   ` Hugh Dickins
2007-10-03  8:13     ` Balbir Singh
2007-10-03 18:47       ` Hugh Dickins
2007-10-04  4:16         ` Balbir Singh
2007-10-04 13:16           ` Hugh Dickins
2007-10-05  3:07             ` Balbir Singh
2007-10-07 17:41               ` Hugh Dickins
2007-10-08  2:54                 ` Balbir Singh [this message]
2007-10-04 16:10     ` Paul Menage
2007-10-10 21:07   ` Rik van Riel
2007-10-11  6:33     ` Balbir Singh
2007-10-02  6:18 ` x86 patches was Re: -mm merge plans for 2.6.24 Andi Kleen
2007-10-02  6:32   ` Andrew Morton
2007-10-02  7:01     ` Andi Kleen
2007-10-02  7:18       ` Andrew Morton
2007-10-02  7:36         ` KAMEZAWA Hiroyuki
2007-10-02  7:43           ` Andrew Morton
2007-10-02  8:16             ` KAMEZAWA Hiroyuki
2007-10-02 10:48               ` Yasunori Goto
2007-10-02 18:18               ` Christoph Lameter
2007-10-02 17:25             ` Lee Schermerhorn
2007-10-02 16:40           ` Nish Aravamudan
2007-10-02 17:17           ` Lee Schermerhorn
2007-10-02 18:16           ` Christoph Lameter
2007-10-02  7:55         ` Matt Mackall
2007-10-02  7:59           ` Andi Kleen
2007-10-02  9:26       ` Andy Whitcroft
2007-10-02  7:37     ` Ingo Molnar
2007-10-02  7:46       ` Andi Kleen
2007-10-02  7:58         ` Thomas Gleixner
2007-10-02  7:59 ` v4l-stk11xx* [Was: -mm merge plans for 2.6.24] Jiri Slaby
     [not found] ` <4701FC79.3060608@gmail.com>
2007-10-02  8:10   ` Wireless damage " Jiri Slaby
2007-10-02  8:17 ` per BDI dirty limit (was Re: -mm merge plans for 2.6.24) Peter Zijlstra
     [not found]   ` <20071002082831.GA19954@mail.ustc.edu.cn>
2007-10-02  8:28     ` Fengguang Wu
2007-10-02  8:31   ` Andrew Morton
2007-10-02  8:48     ` Peter Zijlstra
2007-10-02 10:31       ` Kay Sievers
2007-10-02 10:44         ` Peter Zijlstra
     [not found]           ` <20071002104734.GA9410@mail.ustc.edu.cn>
2007-10-02 10:47             ` Fengguang Wu
2007-10-02 11:22               ` Kay Sievers
     [not found]                 ` <20071002112802.GA12607@mail.ustc.edu.cn>
2007-10-02 11:28                   ` Fengguang Wu
2007-10-02 11:21           ` Kay Sievers
2007-10-02 11:40             ` Peter Zijlstra
2007-10-02 12:05               ` Nick Piggin
2007-10-03 10:15                 ` Kay Sievers
2007-10-03 10:37                   ` Peter Zijlstra
2007-10-03 13:35                     ` Kay Sievers
2007-10-03 13:58                       ` Peter Zijlstra
2007-10-26 14:48                       ` Peter Zijlstra
2007-10-26 15:06                         ` Miklos Szeredi
2007-10-26 15:10                         ` Kay Sievers
2007-10-26 15:22                           ` Peter Zijlstra
2007-10-26 15:33                             ` Kay Sievers
2007-10-26 15:33                               ` Peter Zijlstra
2007-10-26 15:55                                 ` Kay Sievers
2007-10-26 20:04                                   ` Peter Zijlstra
2007-10-27  1:18                                     ` Peter Zijlstra
2007-10-27  2:40                                       ` Greg KH
2007-10-27  8:39                                         ` Peter Zijlstra
2007-10-27 16:02                                           ` Greg KH
2007-10-27 16:07                                             ` Peter Zijlstra
2007-10-27 21:08                                             ` Kay Sievers
2007-10-27 21:35                                               ` Peter Zijlstra
2007-10-28  7:10                                                 ` Greg KH
2007-11-02 13:15                                               ` Peter Zijlstra
2007-11-02 13:50                                                 ` Kay Sievers
2007-11-02 13:54                                                   ` Peter Zijlstra
2007-11-02 14:17                                                   ` Peter Zijlstra
2007-11-02 14:32                                                     ` Kay Sievers
2007-11-02 14:59                                                       ` [PATCH] mm: sysfs: expose the BDI object in sysfs Peter Zijlstra
2007-11-02 15:13                                                         ` Kay Sievers
2007-10-26 16:37                         ` per BDI dirty limit (was Re: -mm merge plans for 2.6.24) Trond Myklebust
2007-12-14 14:50                           ` Peter Zijlstra
2007-12-14 15:14                             ` Miklos Szeredi
2007-12-14 15:54                               ` Peter Zijlstra
2007-10-02 14:38               ` Kay Sievers
2007-10-03 11:00   ` Martin Knoblauch
     [not found] ` <20071002083922.GA28892@mail.ustc.edu.cn>
2007-10-02  8:39   ` writeback fixes Fengguang Wu
2007-10-02 16:06 ` kswapd min order, slub max order [was Re: -mm merge plans for 2.6.24] Hugh Dickins
2007-10-02  9:10   ` Nick Piggin
2007-10-02 18:38   ` Mel Gorman
2007-10-02 18:28     ` Christoph Lameter
2007-10-03  0:37       ` Christoph Lameter
2007-10-02 16:12 ` -mm merge plans for 2.6.24 Pekka Enberg
2007-10-02 16:21 ` new aops merge [was Re: -mm merge plans for 2.6.24] Hugh Dickins
2007-10-02 17:45 ` remove zero_page (was Re: -mm merge plans for 2.6.24) Nick Piggin
2007-10-03 10:58   ` Andrew Morton
2007-10-03 15:21   ` Linus Torvalds
2007-10-08 15:17     ` Nick Piggin
2007-10-09 13:00       ` Hugh Dickins
2007-10-09 14:52       ` Linus Torvalds
2007-10-09  9:31         ` Nick Piggin
2007-10-10  2:22           ` Linus Torvalds
2007-10-09 10:15             ` Nick Piggin
2007-10-10  3:06               ` Linus Torvalds
2007-10-10  4:06               ` Hugh Dickins
2007-10-10  5:20                 ` Linus Torvalds
2007-10-09 14:30                   ` Nick Piggin
2007-10-10 15:04                     ` Linus Torvalds
2007-10-03 19:50 ` A kernel Tracing interface " David Wilder
2007-10-09  9:19 ` r/o bind mounts, was Re: -mm merge plans for 2.6.24 Christoph Hellwig
2007-10-13  8:44 ` Borislav Petkov
2007-10-13  8:52   ` Andrew Morton
2007-10-13 11:45     ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47099BDC.7080701@linux.vnet.ibm.com \
    --to=balbir@linux.vnet.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=hugh@veritas.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=xemul@openvz.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox