From: Michal Hocko <mhocko@suse.cz>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"nishimura@mxp.nes.nec.co.jp" <nishimura@mxp.nes.nec.co.jp>
Subject: Re: [PATCH v3] memcg: add nr_pages argument for hierarchical reclaim
Date: Thu, 18 Aug 2011 16:40:45 +0200 [thread overview]
Message-ID: <20110818144045.GA4105@tiehlicka.suse.cz> (raw)
In-Reply-To: <20110818135821.GA16958@cmpxchg.org>
On Thu 18-08-11 15:58:21, Johannes Weiner wrote:
> On Thu, Aug 18, 2011 at 02:57:54PM +0200, Michal Hocko wrote:
> > I have just realized that num_online_nodes should be much better than
> > MAX_NUMNODES.
> > Just for reference, the patch is based on top of
> > https://lkml.org/lkml/2011/8/9/82 (it doesn't depend on it but it also
> > doesn't make much sense without it)
> >
> > Changes since v2:
> > - use num_online_nodes rather than MAX_NUMNODES
> > Changes since v1:
> > - reclaim nr_nodes * SWAP_CLUSTER_MAX in mem_cgroup_force_empty
> > ---
> > From: Michal Hocko <mhocko@suse.cz>
> > Subject: memcg: add nr_pages argument for hierarchical reclaim
> >
> > Now that we are doing memcg direct reclaim limited to nr_to_reclaim
> > pages (introduced by "memcg: stop vmscan when enough done.") we have to
> > be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
> > most callers but it might cause failures for limit resize or force_empty
> > code paths on big NUMA machines.
>
> The limit resizing path retries as long as reclaim makes progress, so
> this is just handwaving.
limit resizing paths do not check the return value of
mem_cgroup_hierarchical_reclaim so the number of retries is not
affected. It is true that fixing that would be much easier.
>
> After Kame's patch, the force-empty path has an increased risk of
> failing to move huge pages to the parent, because it tries reclaim
> only once. This could need further evaluation, and possibly a fix.
Agreed
> But instead:
>
> > @@ -2331,8 +2331,14 @@ static int mem_cgroup_do_charge(struct m
> > if (!(gfp_mask & __GFP_WAIT))
> > return CHARGE_WOULDBLOCK;
> >
> > + /*
> > + * We are lying about nr_pages because we do not want to
> > + * reclaim too much for THP pages which should rather fallback
> > + * to small pages.
> > + */
> > ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> > - gfp_mask, flags, NULL);
> > + gfp_mask, flags, NULL,
> > + 1);
> > if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> > return CHARGE_RETRY;
> > /*
>
> You tell it to reclaim _less_ than before, further increasing the risk
> of failure...
>
> > @@ -2350,7 +2351,7 @@ unsigned long try_to_free_mem_cgroup_pag
> > .may_writepage = !laptop_mode,
> > .may_unmap = 1,
> > .may_swap = !noswap,
> > - .nr_to_reclaim = SWAP_CLUSTER_MAX,
> > + .nr_to_reclaim = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX),
>
> ...but wait, this transparently fixes it up and ignores the caller's
> request.
>
> Sorry, but this is just horrible!
Yes, I do not like it as well and tried to point it out in the comment.
Anyway I do agree that this doesn't solve the problem you are describing
above and the limit resizing paths can be fixed much easier so the patch
is pointless.
>
> For the past weeks I have been chasing memcg bugs that came in with
> sloppy and untested code, that was merged for handwavy reasons.
Yes, I feel big responsibility about that.
>
> Changes to algorithms need to be tested and optimizations need to be
> quantified in other parts of the VM and the kernel, too. I have no
> idea why this doesn't seem to apply to the memory cgroup subsystem.
Yes, we should definitely do better during review process.
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko@suse.cz>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"nishimura@mxp.nes.nec.co.jp" <nishimura@mxp.nes.nec.co.jp>
Subject: Re: [PATCH v3] memcg: add nr_pages argument for hierarchical reclaim
Date: Thu, 18 Aug 2011 16:40:45 +0200 [thread overview]
Message-ID: <20110818144045.GA4105@tiehlicka.suse.cz> (raw)
In-Reply-To: <20110818135821.GA16958@cmpxchg.org>
On Thu 18-08-11 15:58:21, Johannes Weiner wrote:
> On Thu, Aug 18, 2011 at 02:57:54PM +0200, Michal Hocko wrote:
> > I have just realized that num_online_nodes should be much better than
> > MAX_NUMNODES.
> > Just for reference, the patch is based on top of
> > https://lkml.org/lkml/2011/8/9/82 (it doesn't depend on it but it also
> > doesn't make much sense without it)
> >
> > Changes since v2:
> > - use num_online_nodes rather than MAX_NUMNODES
> > Changes since v1:
> > - reclaim nr_nodes * SWAP_CLUSTER_MAX in mem_cgroup_force_empty
> > ---
> > From: Michal Hocko <mhocko@suse.cz>
> > Subject: memcg: add nr_pages argument for hierarchical reclaim
> >
> > Now that we are doing memcg direct reclaim limited to nr_to_reclaim
> > pages (introduced by "memcg: stop vmscan when enough done.") we have to
> > be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
> > most callers but it might cause failures for limit resize or force_empty
> > code paths on big NUMA machines.
>
> The limit resizing path retries as long as reclaim makes progress, so
> this is just handwaving.
limit resizing paths do not check the return value of
mem_cgroup_hierarchical_reclaim so the number of retries is not
affected. It is true that fixing that would be much easier.
>
> After Kame's patch, the force-empty path has an increased risk of
> failing to move huge pages to the parent, because it tries reclaim
> only once. This could need further evaluation, and possibly a fix.
Agreed
> But instead:
>
> > @@ -2331,8 +2331,14 @@ static int mem_cgroup_do_charge(struct m
> > if (!(gfp_mask & __GFP_WAIT))
> > return CHARGE_WOULDBLOCK;
> >
> > + /*
> > + * We are lying about nr_pages because we do not want to
> > + * reclaim too much for THP pages which should rather fallback
> > + * to small pages.
> > + */
> > ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> > - gfp_mask, flags, NULL);
> > + gfp_mask, flags, NULL,
> > + 1);
> > if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> > return CHARGE_RETRY;
> > /*
>
> You tell it to reclaim _less_ than before, further increasing the risk
> of failure...
>
> > @@ -2350,7 +2351,7 @@ unsigned long try_to_free_mem_cgroup_pag
> > .may_writepage = !laptop_mode,
> > .may_unmap = 1,
> > .may_swap = !noswap,
> > - .nr_to_reclaim = SWAP_CLUSTER_MAX,
> > + .nr_to_reclaim = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX),
>
> ...but wait, this transparently fixes it up and ignores the caller's
> request.
>
> Sorry, but this is just horrible!
Yes, I do not like it as well and tried to point it out in the comment.
Anyway I do agree that this doesn't solve the problem you are describing
above and the limit resizing paths can be fixed much easier so the patch
is pointless.
>
> For the past weeks I have been chasing memcg bugs that came in with
> sloppy and untested code, that was merged for handwavy reasons.
Yes, I feel big responsibility about that.
>
> Changes to algorithms need to be tested and optimizations need to be
> quantified in other parts of the VM and the kernel, too. I have no
> idea why this doesn't seem to apply to the memory cgroup subsystem.
Yes, we should definitely do better during review process.
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2011-08-18 14:40 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-08-09 10:04 [PATCH v5 0/6] memg: better numa scanning KAMEZAWA Hiroyuki
2011-08-09 10:04 ` KAMEZAWA Hiroyuki
2011-08-09 10:08 ` [PATCH v5 1/6] " KAMEZAWA Hiroyuki
2011-08-09 10:08 ` KAMEZAWA Hiroyuki
2011-08-10 10:00 ` Michal Hocko
2011-08-10 10:00 ` Michal Hocko
2011-08-10 23:30 ` KAMEZAWA Hiroyuki
2011-08-10 23:30 ` KAMEZAWA Hiroyuki
2011-08-10 23:44 ` [PATCH] memcg: fix comment on update nodemask KAMEZAWA Hiroyuki
2011-08-10 23:44 ` KAMEZAWA Hiroyuki
2011-08-11 13:25 ` Michal Hocko
2011-08-11 13:25 ` Michal Hocko
2011-08-09 10:09 ` [PATCH v5 2/6] memcg: stop vmscan when enough done KAMEZAWA Hiroyuki
2011-08-09 10:09 ` KAMEZAWA Hiroyuki
2011-08-10 14:14 ` Michal Hocko
2011-08-10 14:14 ` Michal Hocko
2011-08-10 23:52 ` KAMEZAWA Hiroyuki
2011-08-10 23:52 ` KAMEZAWA Hiroyuki
2011-08-11 14:50 ` Michal Hocko
2011-08-11 14:50 ` Michal Hocko
2011-08-12 12:44 ` [PATCH] memcg: add nr_pages argument for hierarchical reclaim Michal Hocko
2011-08-12 12:44 ` Michal Hocko
2011-08-17 0:54 ` [PATCH v5 2/6] memcg: stop vmscan when enough done KAMEZAWA Hiroyuki
2011-08-17 0:54 ` KAMEZAWA Hiroyuki
2011-08-17 11:35 ` Michal Hocko
2011-08-17 11:35 ` Michal Hocko
2011-08-17 23:52 ` KAMEZAWA Hiroyuki
2011-08-17 23:52 ` KAMEZAWA Hiroyuki
2011-08-18 6:27 ` Michal Hocko
2011-08-18 6:27 ` Michal Hocko
2011-08-18 6:42 ` KAMEZAWA Hiroyuki
2011-08-18 6:42 ` KAMEZAWA Hiroyuki
2011-08-18 7:46 ` Michal Hocko
2011-08-18 7:46 ` Michal Hocko
2011-08-18 12:57 ` [PATCH v3] memcg: add nr_pages argument for hierarchical reclaim Michal Hocko
2011-08-18 12:57 ` Michal Hocko
2011-08-18 13:58 ` Johannes Weiner
2011-08-18 13:58 ` Johannes Weiner
2011-08-18 14:40 ` Michal Hocko [this message]
2011-08-18 14:40 ` Michal Hocko
2011-08-09 10:10 ` [PATCH v5 3/6] memg: vmscan pass nodemask KAMEZAWA Hiroyuki
2011-08-09 10:10 ` KAMEZAWA Hiroyuki
2011-08-10 11:19 ` Michal Hocko
2011-08-10 11:19 ` Michal Hocko
2011-08-10 23:43 ` KAMEZAWA Hiroyuki
2011-08-10 23:43 ` KAMEZAWA Hiroyuki
2011-08-09 10:11 ` [PATCH v5 4/6] memg: calculate numa weight for vmscan KAMEZAWA Hiroyuki
2011-08-09 10:11 ` KAMEZAWA Hiroyuki
2011-08-17 14:34 ` Michal Hocko
2011-08-17 14:34 ` Michal Hocko
2011-08-18 0:17 ` KAMEZAWA Hiroyuki
2011-08-18 0:17 ` KAMEZAWA Hiroyuki
2011-08-18 8:41 ` Michal Hocko
2011-08-18 8:41 ` Michal Hocko
2011-08-19 0:06 ` KAMEZAWA Hiroyuki
2011-08-19 0:06 ` KAMEZAWA Hiroyuki
2011-08-09 10:12 ` [PATCH v5 5/6] memg: vmscan select victim node by weight KAMEZAWA Hiroyuki
2011-08-09 10:12 ` KAMEZAWA Hiroyuki
2011-08-18 13:34 ` Michal Hocko
2011-08-18 13:34 ` Michal Hocko
2011-08-09 10:13 ` [PATCH v5 6/6] memg: do target scan if unbalanced KAMEZAWA Hiroyuki
2011-08-09 10:13 ` KAMEZAWA Hiroyuki
2011-08-09 14:33 ` [PATCH v5 0/6] memg: better numa scanning Michal Hocko
2011-08-09 14:33 ` Michal Hocko
2011-08-10 0:15 ` KAMEZAWA Hiroyuki
2011-08-10 0:15 ` KAMEZAWA Hiroyuki
2011-08-10 6:03 ` KAMEZAWA Hiroyuki
2011-08-10 6:03 ` KAMEZAWA Hiroyuki
2011-08-10 14:20 ` Michal Hocko
2011-08-10 14:20 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110818144045.GA4105@tiehlicka.suse.cz \
--to=mhocko@suse.cz \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nishimura@mxp.nes.nec.co.jp \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.