From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail6.bemta8.messagelabs.com (mail6.bemta8.messagelabs.com [216.82.243.55]) by kanga.kvack.org (Postfix) with ESMTP id D0EC6900137 for ; Mon, 29 Aug 2011 16:36:58 -0400 (EDT) Received: from wpaz5.hot.corp.google.com (wpaz5.hot.corp.google.com [172.24.198.69]) by smtp-out.google.com with ESMTP id p7TKaqPt005470 for ; Mon, 29 Aug 2011 13:36:53 -0700 Received: from qyk7 (qyk7.prod.google.com [10.241.83.135]) by wpaz5.hot.corp.google.com with ESMTP id p7TKai3j006492 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT) for ; Mon, 29 Aug 2011 13:36:51 -0700 Received: by qyk7 with SMTP id 7so4503969qyk.0 for ; Mon, 29 Aug 2011 13:36:49 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20110829190426.GC1434@cmpxchg.org> References: <1306909519-7286-1-git-send-email-hannes@cmpxchg.org> <1306909519-7286-3-git-send-email-hannes@cmpxchg.org> <20110811210914.GB31229@cmpxchg.org> <20110829190426.GC1434@cmpxchg.org> Date: Mon, 29 Aug 2011 13:36:48 -0700 Message-ID: Subject: Re: [patch 2/8] mm: memcg-aware global reclaim From: Ying Han Content-Type: multipart/alternative; boundary=00163628429eb5e3dc04abaada73 Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: KAMEZAWA Hiroyuki , Daisuke Nishimura , Michal Hocko , Andrew Morton , Rik van Riel , Minchan Kim , KOSAKI Motohiro , Mel Gorman , Greg Thelen , Michel Lespinasse , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Hugh Dickins , Balbir Singh --00163628429eb5e3dc04abaada73 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Mon, Aug 29, 2011 at 12:04 PM, Johannes Weiner wrote= : > On Mon, Aug 29, 2011 at 12:22:02AM -0700, Ying Han wrote: > > On Mon, Aug 29, 2011 at 12:15 AM, Ying Han wrote: > > > On Thu, Aug 11, 2011 at 2:09 PM, Johannes Weiner > wrote: > > >> > > >> On Thu, Aug 11, 2011 at 01:39:45PM -0700, Ying Han wrote: > > >> > Please consider including the following patch for the next post. I= t > causes > > >> > crash on some of the tests where sc->mem_cgroup is NULL (global > kswapd). > > >> > > > >> > diff --git a/mm/vmscan.c b/mm/vmscan.c > > >> > index b72a844..12ab25d 100644 > > >> > --- a/mm/vmscan.c > > >> > +++ b/mm/vmscan.c > > >> > @@ -2768,7 +2768,8 @@ loop_again: > > >> > * Do some background aging of the anon > list, to > > >> > give > > >> > * pages a chance to be referenced before > > >> > reclaiming. > > >> > */ > > >> > - if (inactive_anon_is_low(zone, &sc)) > > >> > + if (scanning_global_lru(&sc) && > > >> > + inactive_anon_is_low(zone, > &sc)) > > >> > shrink_active_list(SWAP_CLUSTER_MA= X, > zone, > > >> > &sc, > priority, 0); > > >> > > >> Thanks! I completely overlooked this one and only noticed it after > > >> changing the arguments to shrink_active_list(). > > >> > > >> On memcg configurations, scanning_global_lru() will essentially neve= r > > >> be true again, so I moved the anon pre-aging to a separate function > > >> that also does a hierarchy loop to preage the per-memcg anon lists. > > >> > > >> I hope to send out the next revision soon. > > > > > > Also, please consider to fold in the following patch as well. It fixe= s > > > the root cgroup lru accounting and we could easily trigger OOM while > > > doing some swapoff test w/o it. > > > > > > mm:fix the lru accounting for root cgroup. > > > > > > This patch is applied on top of: > > > " > > > mm: memcg-aware global reclaim > > > Signed-off-by: Johannes Weiner > > > " > > > > > > This patch fixes the lru accounting for root cgroup. > > > > > > After the "memcg-aware global reclaim" patch, one of the changes is t= o > have > > > lru pages linked back to root. Under the global memory pressure, we > start from > > > the root cgroup lru and walk through the memcg hierarchy of the syste= m. > For > > > each memcg, we reclaim pages based on the its lru size. > > > > > > However for root cgroup, we used not having a seperate lru and only > counting > > > the pages charged to root as part of root lru size. Without this patc= h, > all > > > the pages which are linked to root lru but not charged to root like > swapcache > > > readahead are not visible to page reclaim code and we are easily to g= et > OOM. > > > > > > After this patch, all the pages linked under root lru are counted in > the lru > > > size, including Used and !Used. > > > > > > Signed-off-by: Hugh Dickins > > > Signed-off-by: Ying Han > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index 5518f54..f6c5f29 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -888,19 +888,21 @@ void mem_cgroup_del_lru_list(struct page *page, > > > enum lru_list lru) > > > { > > > >------struct page_cgroup *pc; > > > >------struct mem_cgroup_per_zone *mz; > > > +>------struct mem_cgroup *mem; > > > =B7 > > > >------if (mem_cgroup_disabled()) > > > >------>-------return; > > > >------pc =3D lookup_page_cgroup(page); > > > ->------/* can happen while we handle swapcache. */ > > > ->------if (!TestClearPageCgroupAcctLRU(pc)) > > > ->------>-------return; > > > ->------VM_BUG_ON(!pc->mem_cgroup); > > > ->------/* > > > ->------ * We don't check PCG_USED bit. It's cleared when the "page" = is > finally > > > ->------ * removed from global LRU. > > > ->------ */ > > > ->------mz =3D page_cgroup_zoneinfo(pc->mem_cgroup, page); > > > + > > > +>------if (TestClearPageCgroupAcctLRU(pc) || PageCgroupUsed(pc)) { > > This PageCgroupUsed part confuses me. A page that is being isolated > shortly after being charged while on the LRU may reach here, and then > it is unaccounted from pc->mem_cgroup, which it never was accounted > to. > > Could you explain why you added it? > To be honest, i don't have very good reason for that. The PageCgroupUsed check is put there after running some tests and some fixes seems help the test, including this one. The one case I can think of for page !AcctLRU | Used is in the pagevec. However, we shouldn't get to the mem_cgroup_del_lru_list() for a page in pagevec at the first place. I now made it so that PageCgroupAcctLRU on the LRU means accounted to > pc->mem_cgroup, this is the same logic currently. > and !PageCgroupAcctLRU on the LRU means accounted to > and babysitted by root_mem_cgroup. this seems to be different from what it is now, especially for swapcache page. So, the page here is linked to root cgroup LRU or not? Anyway, the AcctLRU flags still seems confusing to me: what this flag tells me is that whether or not the page is on a PRIVATE lru and being accounted, i used private here to differentiate from the per zone lru, where it also has PageLRU flag. The two flags are separate since page= s could be on one lru not the other ( I guess ) , but this is changed after having the root cgroup lru back. For example, AcctLRU is used to keep track of the accounted lru pages, especially for root ( we didn't account the !Used pages to root like readahead swapcache). Now we account the full size of lru list of root including Used and !Used, but only mark the Used pages w/ AcctLRU flag. So in general, i am wondering we should be able to replace that eventually with existing Used and LRU bit. Sorry this seems to be something we like t= o consider later, not necessarily now :) Always. Which also means that > before_commit now ensures an LRU page is moved to root_mem_cgroup for > babysitting during the charge, so that concurrent isolations/putbacks > are always accounted correctly. Is this what you had in mind? Did I > miss something? > In my tree, the before->commit->after protocol is folded into one function. I didn't post it since I know you also have patch doing that. So guess I don't understand why we need to move the page to root while it is gonna be charged to a memcg by commit_charge shortly after. My understanding is that in before_commit, we uncharge the page from previous memcg lru if AcctLRU was set, then in the commit_charge we update the new owner of it. And in after_commit we update the memcg lru for the ne= w owner after linking the page in the lru. --Ying --00163628429eb5e3dc04abaada73 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

On Mon, Aug 29, 2011 at 12:04 PM, Johann= es Weiner <hanne= s@cmpxchg.org> wrote:
On Mon, Aug 29, 2011 at 12:22:02AM -0700, Ying Han wrote:=
> On Mon, Aug 29, 2011 at 12:15 = AM, Ying Han <yinghan@google.com> wrote:
> > On Thu, Aug 11, 2011 at 2:09 PM, Johannes Weiner <
hannes@cmpxchg.org> wrote:
> >>
> >> On Thu, Aug 11, 2011 at 01:39:45PM -0700, Ying Han wrote:
> >> > Please consider including the following patch for the ne= xt post. It causes
> >> > crash on some of the tests where sc->mem_cgroup is NU= LL (global kswapd).
> >> >
> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > index b72a844..12ab25d 100644
> >> > --- a/mm/vmscan.c
> >> > +++ b/mm/vmscan.c
> >> > @@ -2768,7 +2768,8 @@ loop_again:
> >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* Do = some background aging of the anon list, to
> >> > give
> >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* pag= es a chance to be referenced before
> >> > reclaiming.
> >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/ > >> > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (inacti= ve_anon_is_low(zone, &sc))
> >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (scanni= ng_global_lru(&sc) &&
> >> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 inactive_anon_is_low(zone, &sc))
> >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 shrink_active_list(SWAP_CLUSTER_MAX, zone,
> >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &sc, priority, = 0);
> >>
> >> Thanks! =A0I completely overlooked this one and only noticed = it after
> >> changing the arguments to shrink_active_list().
> >>
> >> On memcg configurations, scanning_global_lru() will essential= ly never
> >> be true again, so I moved the anon pre-aging to a separate fu= nction
> >> that also does a hierarchy loop to preage the per-memcg anon = lists.
> >>
> >> I hope to send out the next revision soon.
> >
> > Also, please consider to fold in the following patch as well. It = fixes
> > the root cgroup lru accounting and we could easily trigger OOM wh= ile
> > doing some swapoff test w/o it.
> >
> > mm:fix the lru accounting for root cgroup.
> >
> > This patch is applied on top of:
> > "
> > mm: memcg-aware global reclaim
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > "
> >
> > This patch fixes the lru accounting for root cgroup.
> >
> > After the "memcg-aware global reclaim" patch, one of th= e changes is to have
> > lru pages linked back to root. Under the global memory pressure, = we start from
> > the root cgroup lru and walk through the memcg hierarchy of the s= ystem. For
> > each memcg, we reclaim pages based on the its lru size.
> >
> > However for root cgroup, we used not having a seperate lru and on= ly counting
> > the pages charged to root as part of root lru size. Without this = patch, all
> > the pages which are linked to root lru but not charged to root li= ke swapcache
> > readahead are not visible to page reclaim code and we are easily = to get OOM.
> >
> > After this patch, all the pages linked under root lru are counted= in the lru
> > size, including Used and !Used.
> >
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Signed-off-by: Ying Han <yinghan@google.com>
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 5518f54..f6c5f29 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -888,19 +888,21 @@ void mem_cgroup_del_lru_list(struct page *p= age,
> > enum lru_list lru)
> > =A0{
> > =A0>------struct page_cgroup *pc;
> > =A0>------struct mem_cgroup_per_zone *mz;
> > +>------struct mem_cgroup *mem;
> > =B7
> > =A0>------if (mem_cgroup_disabled())
> > =A0>------>-------return;
> > =A0>------pc =3D lookup_page_cgroup(page);
> > ->------/* can happen while we handle swapcache. */
> > ->------if (!TestClearPageCgroupAcctLRU(pc))
> > ->------>-------return;
> > ->------VM_BUG_ON(!pc->mem_cgroup);
> > ->------/*
> > ->------ * We don't check PCG_USED bit. It's cleared w= hen the "page" is finally
> > ->------ * removed from global LRU.
> > ->------ */
> > ->------mz =3D page_cgroup_zoneinfo(pc->mem_cgroup, page);<= br> > > +
> > +>------if (TestClearPageCgroupAcctLRU(pc) || PageCgroupUsed(p= c)) {

This PageCgroupUsed part confuses me. =A0A page that is being i= solated
shortly after being charged while on the LRU may reach here, and then
it is unaccounted from pc->mem_cgroup, which it never was accounted
to.

Could you explain why you added it?

To = be honest, i don't have very good reason for that. The PageCgroupUsed c= heck is put there after running some tests and some fixes seems help the te= st, including this one.

The one case I can think of for page !AcctLRU | Used is= in the pagevec. However, we shouldn't get to the=A0mem_cgroup_del_lru_= list() for a page in pagevec at the first place.

I now made it so that PageCgroupAcctLRU on the LRU means accounted to
pc->mem_cgroup,

this is the same logic = currently.=A0
=A0
and !PageC= groupAcctLRU on the LRU means accounted to
and babysitted by root_mem_cgroup. =A0

this= seems to be different from what it is now, especially for swapcache page. = So, the page here is linked to root cgroup LRU or not?

Anyway, the AcctLRU flags still seems confusing to me:

<= /div>
what this flag tells me is that whether or not the page is o= n a PRIVATE lru and being accounted, i used private here to differentiate f= rom the per zone lru, where it also has PageLRU flag. =A0The two flags are= =A0separate since pages could be on one lru not the other ( I guess ) , but= this is changed after having the root cgroup lru back. For example,=A0AcctLRU is used to keep track of the accounted lru pages, especiall= y for root ( we didn't account the !Used pages to root like readahead s= wapcache). Now we account the full size of lru list of root including Used = and !Used, but only mark the Used pages w/ AcctLRU flag.=A0

So in general, i am wondering we shoul= d be able to replace that eventually with existing Used and LRU bit. =A0Sor= ry this seems to be something we like to consider later, not necessarily no= w :)

Always. =A0Which also means that
before_commit now ensures an LRU page is moved to root_mem_cgroup for
babysitting during the charge, so that concurrent isolations/putbacks
are always accounted correctly. =A0Is this what you had in mind? =A0Did I miss something?

In my tree, the before-= >commit->after=A0protocol=A0is folded into one function. I didn't= post it since I know you also have patch doing that. =A0So guess I don'= ;t understand why we need to move the page to root while it is gonna be cha= rged to a memcg by commit_charge shortly after.

My understanding is that in before_commit, we uncharge = the page from previous memcg lru if AcctLRU was set, then in the commit_cha= rge we update the new owner of it. And in after_commit we update the memcg = lru for the new owner after linking the page in the lru.=A0

--Ying
=A0

--00163628429eb5e3dc04abaada73-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org